Refresh schema file Dataset in recipe Python

Hazou
Level 1
Refresh schema file Dataset in recipe Python

Hello, 

I have many dataset from csv files and every month i have new files (new files replace previous files), i develop a python program to control name columns. 

I put a new file with a new column or an other name column to test my code (example : years instead of year), when i run my recipe python, i have name column from my previous file. 

test = dataiku.Dataset("test")
test_df = test.get_dataframe() 

How can i fix that in my recipe python ? is there function to refresh schema ? 

Thank you  

 

0 Kudos
7 Replies
Marlan

Hi @Hazou,

I may not be completely following what you are asking, let me know if I'm misunderstanding.

My understanding is that each month you are populating the same output datasets from text files that may have different column names. You want to be able to update the column names in the output datasets to the new names.

First, you may be able to use the dataset write_with_schema method to populate the output datasets and overwrite the schema in the same step. This uses the columns from the passed dataframe to set the dataset schema. This assumes you are reading the text files into pandas dataframes.

If you aren't reading text files into dataframes or you want more control over the dataset schema, you can also update it directly using the write_schema method. You pass a list of dicts to this method. Following is an example. This assumes "test" is your output dataset and that you build the list of dicts within your Python script.

test = dataiku.Dataset("test")
schema = [{'name': 'col1', 'type': 'int'}, {'name': 'col2', 'type': 'string'}]
test.write_schema(schema)

 

You can find the list of valid data types in DSS UI by looking at an existing dataset via Settings and then Schema. The types drop down shows possible data types. Common ones are int, string, date, float.

If you just want to modify a few columns and thus start with the existing schema, you can get that by a get_config method call: test.get_config()['schema']['columns']

Here is the relevant documentation: https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html

Hope this is helpful.

Marlan

0 Kudos
Hazou
Level 1
Author

Thank you Marlan for your reply. 

it's not exactly my problem. 

Indeed, I receive the same file every month. I want to check that the column names are ok.

Step by Step : 

1. I put the file (test.csv) with the good name column on my server :   (id, year, sale) 

2. I create a new dataset from this file 

3. I run my python program, everything is ok because i have the good name (id, year, sale)

4. I change a column from my file year with s (id, years, sale) 

5. I put this file (test.csv) on my server with the same name 

6. I run my python program, normaly i must have an error but no. 

When i look my dataset, i have always the previous name. 

My program python : 

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import datetime
import sys

Flux_Controle = pd.DataFrame(columns=['Dataset','Date_controle', 'Column'])
controle_list = []

def controleFichier(name, Colonnes) :
controle = dataiku.Dataset(name)
controle_df = controle.get_dataframe()

list_verify= pd.Series(Colonnes)
list_Flux = controle_df.columns.tolist()
print("verif")
print(list_verify)
print("flux")
print(list_Flux)
if set(list_Flux) == set(list_verify) :
pass
else:
controle_list.append('ko')
list_difference = list(set(list_verify)-set(list_Flux)) + list(set(list_Flux)-set(list_verify))
row = [name, datetime.date.today(), list_difference]
Flux_Controle.loc[len(Flux_Controle)] = row


controleFichier("test_1",['id','year','sale'])


# WRITE OUTPUT

test1 = dataiku.Dataset("Flux_Controle")
test1.write_with_schema(Flux_Controle)
controle_list = set(controle_list)
if 'ko' in controle_list :
raise ValueError('Problem')

 

0 Kudos
Marlan

Hi @Hazou,

So the problem is that the dataset Flux_Controle doesn't include a row that shows the difference in columms (i.e. between year and years) when it should? Sorry, still not sure I'm following... as that wouldn't relate to schemas.

Marlan

0 Kudos
Hazou
Level 1
Author

Yes the problem is that the dataset Flux_Controle doesn't include a row that shows the difference in columms (i.e. between year and years).

When i open the dataset from the flow , i have always thรฉ previous name (year), when  i check the schema, nothing update. 

I need to refresh schema, with python or from scenario (i tried build dataset). 

thank you 

0 Kudos
Marlan

Hi @Hazou,

You say, when you open the dataset from the flow you don't see the updated schema. Which dataset are you referring to? I assume it's "test_1" not "Flux_Controle"?

If so then the fact that Flux_Controle doesn't include a row that shows the difference in columns is a symptom of the underlying problem that "test_1" doesn't have an updated schema.

How do you create the "test_1" dataset? Do you use a Python recipe? If so, can you share that code?

Marlan

0 Kudos
Hazou
Level 1
Author

Hello @Marlan 

I don't use recipe python to create test_1. 

Test_1 = import file test.csv from HDFS. 

1. I put the file (test.csv) with the good name column on my server :   (id, year, sale) 

2. I create a new dataset from this file (dataset = test_1)

3. I run my python program, everything is ok because i have the good name (id, year, sale)

4. I change a column from my file year with s (id, years, sale) 

5. I put this file (test.csv) on my server with the same name 

6. I run my python program, normaly i must have an error but no 

When i open the dataset from the flow, i have always year. 

0 Kudos
Marlan

Hi @Hazou,

OK, I think I'm understanding what you are doing better...

Check out this solution - the problem posted appears to be similar to yours: https://community.dataiku.com/t5/Using-Dataiku-DSS/Refresh-read-schema-on-dataset-via-API/m-p/8730/h...

Does this help?

Marlan

0 Kudos