How to create and store a "Main Table" used in a several projects
Hello Everyone,
I am reaching you to get some advice.
The company where I am working is trying to switch - step-by-step – a lot of programs to Dataiku. A lot of these programs are running on some outdated tools and / or languages and executed manually (almost every week).
The idea is now to centralize and automatize everything we can.
At this time, we are focusing on the decommissioning of Coheris Liberty (Harry Pilot before).
I don’t who will know it on this community but to explain quickly, it helps to build queries (SQL). You can “pre-code” a lot of variables, and create some small tables for correspondences (pretty sure there is an English word for this but can’t find it I am sorry…) as :
Code | Name |
1 | Green |
2 | Blue |
3 | Red |
4 | Yellow |
The problem to switch to Dataiku is with these tables, we have a lot. And some are more complicated as :
Code 1 | Code 2 | Code 3 | Name |
1 | 1 | 1 | France |
1 | 1 | 2 | Germany |
1 | 1 | 3 | USA |
1 | 2 | 1 | Poland |
1 | 2 | 3 | Spain |
2 | 1 | 1 | Luxembourg |
A lot of people (and code) are using these “rules” already established and will continue to do it with Dataiku.
The thing is, I am wondering how to transpose them and be able to reach and use them quickly and easily. It’s important we can continue to update them occasionally.
One of the problem is we are not able to upload an xls dataset from our computers for working on a project so we can’t just get a file which we will update and manage it in one server folder.
I am wondering if creating a "very big" Pyhton dictionary or some sort of “main table” that we will store on a server reachable by Dataiku, coud be good ideas.
So that’s why I am seeking for help. What would you do in this situation?
Thanks a lot for reading me.
PS : We are working on Dataiku version 9 but we will work on the 11th in 6 months.
Operating system used: Windows
Best Answer
-
Marlan Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Dataiku Frontrunner Awards 2021 Participant, Neuron 2023 Posts: 319 Neuron
Hi @Sv3n-Sk4
,I'm not sure which option I'd choose if I were you. But I certainly would consider the option of putting an editable dataset in a central project and sharing that to other projects that need it.
Here is a link to the editable dataset documentation: https://doc.dataiku.com/dss/latest/connecting/editable-datasets.html
Marlan
Answers
-
Hi @Sv3n-Sk4
,It's possible to share a dataset between multiple projects. You could create the "main tables" in one project using any format that DSS supports (SQL, S3, etc), and then share them with any projects that need it.
For information on how to share datasets, see Shared objects.
Thanks,
Zach
-
Sv3n-Sk4 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 32 ✭✭✭✭
Hi @ZachM
,
Thanks a lot for your answerI did know I could use a dataset usable in many projects but is it - in your opinon - the best way to do what I want to do ? Is the solution of a shared libraries with python dictionaries not usable ?
If I create a "main table" do you think it's better to create a table with a lot of columns (Code 1 / Name 1 ; Code 2 / Name 2 ; etc.) ?
The goal is to be able to use condition in the dataset from an other dataset to name individuals depending the code.
As example :
If value of a column = 3 then replace it by the name corresponding of the code 3 from the good main dataset.
Thanks a lot again for your quick answer !!
-
Hi @Sv3n-Sk4
,For your use case, using a shared library would probably work better than a dataset since the tables would be easier to access that way.
As an alternative, you could use global variables, which can be accessed via Python.
You can set global variables by going to Administration > Settings > Variables:
You can access them in Python from any project like this:
import json import dataiku variables = dataiku.get_custom_variables() code_table = json.loads(variables["code_table"]) # Prints "blue" print(code_table["1"])
For more information about variables, see Variables.
Thanks,
Zach
-
Sv3n-Sk4 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 32 ✭✭✭✭
Thanks again for you time @ZachM
.I will explore the way of a python's dictionary, however, I am not sure it will be very readable for my colleagues as they are not fluent with python and the dictionary would be a big one, and it won't be easy to update it if needed.
Creating variables will get the same issue as it will be hard to follow for everyone.
I will try to find an usable way and easy one for everyone.I am starting to learn about API, I am not sure if I can create one where I would store all my prepared code (or my existing tables) and get it when needed (not sure if I can, not sure I know and not sure if my company will allow it).
What I can see know is that it doesn't seem to have a perfect solution for my problem. I will need to find the best and compatible oneThanks again!
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Hi @Sv3n-Sk4
the Product Ideas board is here to let you share and exchange your ideas on how to improve Dataiku so please feel free to utilize it if you think there is an opportunity! Here are some resources to help get you started:Suggest an idea I hope this helps!
I hope this helps!
-
Sv3n-Sk4 Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 32 ✭✭✭✭
Thanks @Marlan
!
I think it's gonna take some time to translate the solution in a big editable dataset but it seems to be the easiest and most understandable way to do it for the whole team.