Discover all of the brand-new features and improvements to existing capabilities in the Dataiku 11.3 updateLET'S GO

How to create and store a "Main Table" used in a several projects

Solved!
Sv3n-Sk4
Level 3
How to create and store a "Main Table" used in a several projects

Hello Everyone,

I am reaching you to get some advice.

The company where I am working is trying to switch - step-by-step – a lot of programs to Dataiku. A lot of these programs are running on some outdated tools and / or languages and executed manually (almost every week).

The idea is now to centralize and automatize everything we can.

At this time, we are focusing on the decommissioning of Coheris Liberty (Harry Pilot before).

I don’t who will know it on this community but to explain quickly, it helps to build queries (SQL). You can “pre-code” a lot of variables, and create some small tables for correspondences (pretty sure there is an English word for this but can’t find it I am sorry…) as :

CodeName
1Green
2Blue
3Red
4Yellow

 

The problem to switch to Dataiku is with these tables, we have a lot. And some are more complicated as :

Code 1Code 2Code 3Name
111France
112Germany
113

USA

121

Poland

123

Spain

211

Luxembourg

 

A lot of people (and code) are using these “rules” already established and will continue to do it with Dataiku.

The thing is, I am wondering how to transpose them and be able to reach and use them quickly and easily. It’s important we can continue to update them occasionally.

One of the problem is we are not able to upload an xls dataset from our computers for working on a project so we can’t just get a file which we will update and manage it in one server folder.

I am wondering if creating a "very big" Pyhton dictionary or some sort of “main table” that we will store on a server reachable by Dataiku, coud be good ideas.

So that’s why I am seeking for help. What would you do in this situation?

Thanks a lot for reading me.

PS : We are working on Dataiku version 9 but we will work on the 11th in 6 months.


Operating system used: Windows

0 Kudos
1 Solution
Marlan

Hi @Sv3n-Sk4,

I'm not sure which option I'd choose if I were you. But I certainly would consider the option of putting an editable dataset in a central project and sharing that to other projects that need it. 

Here is a link to the editable dataset documentation: https://doc.dataiku.com/dss/latest/connecting/editable-datasets.html

Marlan

View solution in original post

8 Replies
ZachM
Dataiker

Hi @Sv3n-Sk4,

It's possible to share a dataset between multiple projects. You could create the "main tables" in one project using any format that DSS supports (SQL, S3, etc), and then share them with any projects that need it.

For information on how to share datasets, see Shared objects.

 

Thanks,

Zach

0 Kudos
Sv3n-Sk4
Level 3
Author

Hi @ZachM,

Thanks a lot for your answer 🙂

I did know I could use a dataset usable in many projects but is it - in your opinon - the best way to do what I want to do ?

Is the solution of a shared libraries with python dictionaries not usable ?

If I create a "main table" do you think it's better to create a table with a lot of columns (Code 1 / Name 1 ; Code 2 / Name 2 ; etc.) ?

The goal is to be able to use condition in the dataset from an other dataset to name individuals depending the code.

As example :

If value of a column = 3 then replace it by the name corresponding of the code 3 from the good main dataset.

Thanks a lot again for your quick answer !! 

0 Kudos
ZachM
Dataiker

Hi @Sv3n-Sk4,

For your use case, using a shared library would probably work better than a dataset since the tables would be easier to access that way.

 

As an alternative, you could use global variables, which can be accessed via Python.

You can set global variables by going to AdministrationSettingsVariables:

image.png

 

You can access them in Python from any project like this:

import json

import dataiku


variables = dataiku.get_custom_variables()
code_table = json.loads(variables["code_table"])
# Prints "blue"
print(code_table["1"])

 

For more information about variables, see Variables

Thanks,

Zach

0 Kudos
Sv3n-Sk4
Level 3
Author

Thanks again for you time @ZachM .

I will explore the way of a python's dictionary, however, I am not sure it will be very readable for my colleagues as they are not fluent with python and the dictionary would be a big one, and it won't be easy to update it if needed.

Creating variables will get the same issue as it will be hard to follow for everyone.


I will try to find an usable way and easy one for everyone.  

I am starting to learn about API, I am not sure if I can create one where I would store all my prepared code (or my existing tables)  and get it when needed (not sure if I can, not sure I know and not sure if my company will allow it).

What I can see know is that it doesn't seem to have a perfect solution for my problem. I will need to find the best and compatible one 😉

Thanks again!

0 Kudos
CoreyS
Dataiker Alumni

Hi @Sv3n-Sk4 the Product Ideas board is here to let you share and exchange your ideas on how to improve Dataiku so please feel free to utilize it if you think there is an opportunity! Here are some resources to help get you started:

How to suggest Dataiku ideas 
Participating on the Product Ideas board 
Suggest an idea

I hope this helps!

 

I hope this helps!

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as ‘Accepted Solution’ to help others like you!
Sv3n-Sk4
Level 3
Author

Thanks @CoreyS !

Will have a look 🙂

Marlan

Hi @Sv3n-Sk4,

I'm not sure which option I'd choose if I were you. But I certainly would consider the option of putting an editable dataset in a central project and sharing that to other projects that need it. 

Here is a link to the editable dataset documentation: https://doc.dataiku.com/dss/latest/connecting/editable-datasets.html

Marlan

Sv3n-Sk4
Level 3
Author

Thanks @Marlan !

I think it's gonna take some time to translate the solution in a big editable dataset but it seems to be the easiest and most understandable way to do it for the whole team.

🙂