Refresh schema file Dataset in recipe Python

Hazou
Hazou Registered Posts: 7 ✭✭✭✭

Hello,

I have many dataset from csv files and every month i have new files (new files replace previous files), i develop a python program to control name columns.

I put a new file with a new column or an other name column to test my code (example : years instead of year), when i run my recipe python, i have name column from my previous file.

test = dataiku.Dataset("test")
test_df = test.get_dataframe()

How can i fix that in my recipe python ? is there function to refresh schema ?

Thank you

Tagged:

Answers

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron
    edited July 17

    Hi @Hazou
    ,

    I may not be completely following what you are asking, let me know if I'm misunderstanding.

    My understanding is that each month you are populating the same output datasets from text files that may have different column names. You want to be able to update the column names in the output datasets to the new names.

    First, you may be able to use the dataset write_with_schema method to populate the output datasets and overwrite the schema in the same step. This uses the columns from the passed dataframe to set the dataset schema. This assumes you are reading the text files into pandas dataframes.

    If you aren't reading text files into dataframes or you want more control over the dataset schema, you can also update it directly using the write_schema method. You pass a list of dicts to this method. Following is an example. This assumes "test" is your output dataset and that you build the list of dicts within your Python script.

    test = dataiku.Dataset("test")
    schema = [{'name': 'col1', 'type': 'int'}, {'name': 'col2', 'type': 'string'}]
    test.write_schema(schema)

    You can find the list of valid data types in DSS UI by looking at an existing dataset via Settings and then Schema. The types drop down shows possible data types. Common ones are int, string, date, float.

    If you just want to modify a few columns and thus start with the existing schema, you can get that by a get_config method call: test.get_config()['schema']['columns']

    Here is the relevant documentation: https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html

    Hope this is helpful.

    Marlan

  • Hazou
    Hazou Registered Posts: 7 ✭✭✭✭

    Thank you Marlan for your reply.

    it's not exactly my problem.

    Indeed, I receive the same file every month. I want to check that the column names are ok.

    Step by Step :

    1. I put the file (test.csv) with the good name column on my server : (id, year, sale)

    2. I create a new dataset from this file

    3. I run my python program, everything is ok because i have the good name (id, year, sale)

    4. I change a column from my file year with s (id, years, sale)

    5. I put this file (test.csv) on my server with the same name

    6. I run my python program, normaly i must have an error but no.

    When i look my dataset, i have always the previous name.

    My program python :

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    import datetime
    import sys

    Flux_Controle = pd.DataFrame(columns=['Dataset','Date_controle', 'Column'])
    controle_list = []

    def controleFichier(name, Colonnes) :
    controle = dataiku.Dataset(name)
    controle_df = controle.get_dataframe()

    list_verify= pd.Series(Colonnes)
    list_Flux = controle_df.columns.tolist()
    print("verif")
    print(list_verify)
    print("flux")
    print(list_Flux)
    if set(list_Flux) == set(list_verify) :
    pass
    else:
    controle_list.append('ko')
    list_difference = list(set(list_verify)-set(list_Flux)) + list(set(list_Flux)-set(list_verify))
    row = [name, datetime.date.today(), list_difference]
    Flux_Controle.loc[len(Flux_Controle)] = row


    controleFichier("test_1",['id','year','sale'])


    # WRITE OUTPUT

    test1 = dataiku.Dataset("Flux_Controle")
    test1.write_with_schema(Flux_Controle)
    controle_list = set(controle_list)
    if 'ko' in controle_list :
    raise ValueError('Problem')

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron

    Hi @Hazou
    ,

    So the problem is that the dataset Flux_Controle doesn't include a row that shows the difference in columms (i.e. between year and years) when it should? Sorry, still not sure I'm following... as that wouldn't relate to schemas.

    Marlan

  • Hazou
    Hazou Registered Posts: 7 ✭✭✭✭

    Yes the problem is that the dataset Flux_Controle doesn't include a row that shows the difference in columms (i.e. between year and years).

    When i open the dataset from the flow , i have always thé previous name (year), when i check the schema, nothing update.

    I need to refresh schema, with python or from scenario (i tried build dataset).

    thank you

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron

    Hi @Hazou
    ,

    You say, when you open the dataset from the flow you don't see the updated schema. Which dataset are you referring to? I assume it's "test_1" not "Flux_Controle"?

    If so then the fact that Flux_Controle doesn't include a row that shows the difference in columns is a symptom of the underlying problem that "test_1" doesn't have an updated schema.

    How do you create the "test_1" dataset? Do you use a Python recipe? If so, can you share that code?

    Marlan

  • Hazou
    Hazou Registered Posts: 7 ✭✭✭✭

    Hello @Marlan

    I don't use recipe python to create test_1.

    Test_1 = import file test.csv from HDFS.

    1. I put the file (test.csv) with the good name column on my server : (id, year, sale)

    2. I create a new dataset from this file (dataset = test_1)

    3. I run my python program, everything is ok because i have the good name (id, year, sale)

    4. I change a column from my file year with s (id, years, sale)

    5. I put this file (test.csv) on my server with the same name

    6. I run my python program, normaly i must have an error but no

    When i open the dataset from the flow, i have always year.

  • Marlan
    Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 320 Neuron

    Hi @Hazou
    ,

    OK, I think I'm understanding what you are doing better...

    Check out this solution - the problem posted appears to be similar to yours: https://community.dataiku.com/t5/Using-Dataiku-DSS/Refresh-read-schema-on-dataset-via-API/m-p/8730/highlight/true#M4405

    Does this help?

    Marlan

Setup Info
    Tags
      Help me…