How to modify values of a dataset using an overriding file

AlexandreLH
Level 2
How to modify values of a dataset using an overriding file

Hi,

I am currently designing a worklow using dataiku that would enable to modify values of a data set using information from another file.

Let's say that my dataset has 50 columns, and some information inside of it need to be changed to another value. For instance : 

-I need to modify every values of column nยฐ 30 with "Blue" for the line that has Column 1 set to "Dog"

-I need to modify the value of column nยฐ 17 with "Europe" if the line has Column 1 set to "Cat" and Column 2 set to "Man"

Therefore, I thought about designing a file with a specific template of 8 columns : 

-5 key columns : in which i need to fill in information to set a key on which to select the lign to update

-column tag : in which i shall fill in the name of the column that need to be changed

-New value tag : in which I shall fill in the new value that need to be put in the dataset.

-Comment : not used for the update execution, just for tracking down updates reason

The game consists then to filter using the 5 columns (or less, to make the file flexible)

I meet difficulties however figuring out what recipe to use and what exactly to do to make that search, knowing each line of the file would generate a different search.

Does creating variables is the right approach?, Any other suggestions?

 

Thank you very much for your help!

Alexandre

0 Kudos
4 Replies
emate
Level 5

Hi @AlexandreLH 

I'm not sure if I understand, but let me ask some questions that might help me understand it.

Let's take the first example, you want to fill /change value of column nr 30 with value = Blue if in Column1 = Dog, are these 2 columns exist in the same dataset with all 50 columns? If yes you don't need another file, you just need "formula recipe" for example.

If this Column1 = Dog, which you want to use to change the value of Col50 is in another file, I guess you would need to use "join" recipe (if there is a key between these 2 files) and then, again use for example 'formula recipe' and change rows where this newly joined Col1 = Dog?

 

Mateusz 

0 Kudos
AlexandreLH
Level 2
Author

Hi @emate

Thank you for your quick feedback.

Indeed, the five first columns name in the override file attached to my previous post are included among the columns of the dataset I want to modify using this override file.

We set these five columns that shall always be the same in the template, among the 50 of the data set, to complete (all of them or partly) in order to filter on the lines of the dataset we shall modify a value of the specified colum. 

The difficulty as I picture it right now is to program how dataiku shall create one filter per ligne of the override file and apply a modification on a column that could be any of the 50.

I updated the attached file to show an example of dataset update using the previous examples.

Alexandre

0 Kudos
emate
Level 5

Can you try Prepare recipe -> Formula and then try something like:

if(Name of Column 1== 'Dog', "Blue", Name of Column 30) or 
if(Name of Column 1== 'Dog'), "Blue", Name of Column 30)) or

if(Name of the Column 1=="Dog", replace(Column 30, "Red", "Blue"),Column 30)

And then repeat this formula step for each of the conditions, I dont remember exact syntax, but I am pretty sure that prepare recipe should work

Mateusz

0 Kudos
AlexandreLH
Level 2
Author

The Prepare recipe does work. However, the modifications to apply to my dataset are not always the same (not always 2 lines in the file, not always on column 1, could but any of columns 1 to 5 and any combination between 1 to 5). 

At time t, I receive a dataset, but I know that there are some values that are wrong and I want to correct them. Since there could be 100 values to modify, I wanted to make the user go through a file to override the change in a more organized way but still flexible. Thus, the user fill the values to change in a csv file he would upload and then expect dataiku to operate the modifications using a dynamic equivalent of the Prepare recipe.

But at time t+1, i get new information that tells me to modify other values of my dataset. I would therefore upload a new file (same 7 columns format) in dataiku and the program would then apply the new changes to the dataset. 

In each file, the number of steps depends on the number of line in the file. If I have a 20 lines file, i would expect to make changes 20 times on the line relevent (based on the 5 column key). If the next file has 100 lines, I would apply 100 changes on all concerned lines of the dataset

In the Formula/Prepare Recipe, it would not be :

if(Name of Column 1== 'Dog', "Blue", Name of Column 30)

It would rather be more like variables  (sorry for the upcoming syntax) :

for line number i of the override file :

if ((concatenate(Dataset value in Column 1;Dataset value in Column 2;Dataset value in Column 3; Dataset value in Column 4;Dataset value in Column5)) = concatenate(Override file value line i column 1;Override file value line i column 2; Override file value line i column 3;Override file value line i column 4;Override file value line i column 5)

then Dataset value in Column named {Override file value line i Column 6} is set to {Override file value line i Column 7}.

I'm sure the syntax is completely wrong, but I would expect something like this, since I don't how many updates there would be.

I'm sorry if I'm not making it simple to understand. The main goal is to enable to update a dataset using another file (Override Template). This way the user doesn't have to write any formula since I would hope dataiku program would recognize the line on which to apply every updates listed in the override file. 

0 Kudos