Can I use my own python scripts?

Solved!
ASten1
Level 3
Can I use my own python scripts?

Hi,

I have some python scripts that I would like to use in my flow. I've seen that it is possible to load scripts in libraries and import them in python recipes, but I was wondering if there is a way to directly use them inside my flow, without having to write a python recipe.

 

Thank you!

0 Kudos
1 Solution
Ignacio_Toledo

Hi @ASten1 . I think I understand you too, but I have a different approach than @tgb417, which might indicate that we have different understandings.

Whenever I introduce somebody to the DSS, that comes from the software development / engineering world, I like to make a difference between python recipes/scripts AND python libraries or development:

  • A python Recipe (or script) is a code that I would write using exactly that tool on DSS: Python "Code recipes". And as a recipe what this code will do is to take the input (a dataframe), some variables perhaps, and then return a result as a file or a new dataset. In the middle, we can import libraries, use some pandas or dataiku methods, BUT I will always avoid creating functions or classes in a recipe (or in a Jupyter notebook for that matter), to make a clear difference with software developments that might be needed.
  • So, if someone needs to create a library (you don't find anyone matching your needs from pip), because you need to develop classes and/or define functions that might need further development, testing or even the help of an expert software developer (which I'm not), we use the "Libraries" section of DSS Project. Here you can connect the libraries to a git repository (see snapshot attached), that you can develop in a IDE outside dataiku DSS.

As far as the library developed doesn't require any change from the input dataset (a schema change, a new column), and don't change the output format, then the Python "Code Recipe", after importing and using the library to transform the data, won't require any more changes in a regular basis, and you can make all the development in a git repository that you then checkout at the "libraries" section of your project. This means that we seldomly touch the flow again, and whenever you update the "libraries" the new algorithms will be applied automatically.

Does it make sense? Or is Tom approach more related to your question?

-----

@tgb417 , I have an answer to your excellent question that might ignite some debate to be followed in a different place: "Why does working inside of the constraints that DSS provides cause you problems?"

Because, in my opinion, DSS is about the Data Science full life cycle, and not about Software Development. As I said above, writing libraries, classes, functions in a jupyter notebook is not a good practice in my opinion. Of course there might be some gray areas, and you can cross the line when needed (like Pratchett wrote: โ€œLook, that's why there's rules, understand? So that you think before you break 'em.โ€), but in general this philosophy has been really helpful in separate the roles of Data Analysts and Software Developers, and had lead us to good collaborations (without stepping in each other toes).

Cheers!

View solution in original post

6 Replies
tgb417

@ASten1 

There are lots of ways to include Python code in DSS Flows.  Can you say a little bit more about what you would like to achieve?

Have you discovered the DSS specific Python Library feature?  Have you discovered DSS plugins, code environments...

--Tom

--Tom
0 Kudos
ASten1
Level 3
Author

Yes, I have read the documentation, but I haven't understood if it is possible to do what I need. My goal is to create a flow in DSS where I can use directly my scripts that are saved in some folder that is not in DSS.
It is clear to me that using libraries I can upload python scripts and then import them in a python recipe to use the modules defined in them. But what I wish to achieve is something more direct, like using a script directly in my flow, so that I can work on my code without having to adapt the recipe in DSS every time I change something in my scripts.
I'll make an example so that maybe it can be clearer. Suppose I want to create a pipeline, and that one of the pipeline steps is to run a training script. I require that the training script has a dataframe as input and a model and a file with training performances as output. Then to make work the flow I just need a script that satisfies those requirements, without any prescription about what it does. What I understood is that uploading a script in the library, I necessarily need to write a python recipe which calls modules from those scripts and I can't use directly those scripts as a recipe. That reflects in having to adapt the training recipe every time I modify what the script does.

Instead, I would like that the flow "watches" in some folder ( for example a git folder) where I can work on my code and when I modify something, it is already reflected in my flow, so that I don't have to adapt the recipe or something else.

Essentially I need to create a flow that is autonomous, once I created it, I can work on my code without having to touch the recipes again. Can I create a step in my flow that simply runs a python script saved in some folder? 

I hope now it is more clear what I meant, if it is not the case, I'll try to explain again. 

Thank you for your time!

0 Kudos
tgb417

@ASten1 

I think I get what you want to do.  

Code somewhere not under the control of DSS.  Inserted as a step in DSS flow.  You want to make code change and not need to change anything in DSS Flow however get the new behavior. 

If it is data you are sharing that might change its layout.  This might make it hard for DSS to understand the Skema coming off of your data.  My sense is that you are going to need at least some code in a DSS recipe to capture that and ship that to a data store controlled by DSS.

In thinking further about this.  I'm wondering if you are aware of the DSS REST API.  That would allow you to do many things from outside DSS to the DSS environment including a workflow.  Your code could be separate but control things inside DSS.  The API would represent the contract between your work and the DSS environment.  However, I don't believe that your code would appear as a step in the flow.  Although your code could create steps in a flow.  Also, note:

  • access to the REST API is not available in the Free version of DSS.  Do you have a full DSS license in your organization?
  • You need security permission to make changes via the DSS REST API.

One of the things you did not mention is the value to you or your organization in doing this work outside DSS.  Why does working inside of the constraints that DSS provides cause you problems?

 

--Tom
0 Kudos
Ignacio_Toledo

Hi @ASten1 . I think I understand you too, but I have a different approach than @tgb417, which might indicate that we have different understandings.

Whenever I introduce somebody to the DSS, that comes from the software development / engineering world, I like to make a difference between python recipes/scripts AND python libraries or development:

  • A python Recipe (or script) is a code that I would write using exactly that tool on DSS: Python "Code recipes". And as a recipe what this code will do is to take the input (a dataframe), some variables perhaps, and then return a result as a file or a new dataset. In the middle, we can import libraries, use some pandas or dataiku methods, BUT I will always avoid creating functions or classes in a recipe (or in a Jupyter notebook for that matter), to make a clear difference with software developments that might be needed.
  • So, if someone needs to create a library (you don't find anyone matching your needs from pip), because you need to develop classes and/or define functions that might need further development, testing or even the help of an expert software developer (which I'm not), we use the "Libraries" section of DSS Project. Here you can connect the libraries to a git repository (see snapshot attached), that you can develop in a IDE outside dataiku DSS.

As far as the library developed doesn't require any change from the input dataset (a schema change, a new column), and don't change the output format, then the Python "Code Recipe", after importing and using the library to transform the data, won't require any more changes in a regular basis, and you can make all the development in a git repository that you then checkout at the "libraries" section of your project. This means that we seldomly touch the flow again, and whenever you update the "libraries" the new algorithms will be applied automatically.

Does it make sense? Or is Tom approach more related to your question?

-----

@tgb417 , I have an answer to your excellent question that might ignite some debate to be followed in a different place: "Why does working inside of the constraints that DSS provides cause you problems?"

Because, in my opinion, DSS is about the Data Science full life cycle, and not about Software Development. As I said above, writing libraries, classes, functions in a jupyter notebook is not a good practice in my opinion. Of course there might be some gray areas, and you can cross the line when needed (like Pratchett wrote: โ€œLook, that's why there's rules, understand? So that you think before you break 'em.โ€), but in general this philosophy has been really helpful in separate the roles of Data Analysts and Software Developers, and had lead us to good collaborations (without stepping in each other toes).

Cheers!

ASten1
Level 3
Author

Thank you for your replies, both gave me good insights to understand better the tool. @tgb417 what you wrote in the first sentence perfectly matches what I need. If I have understood well, what I wanted to do in the first place is not possible, but given the APIs and a smart using of the libraries, I can achieve the same results. I've moved in that direction and for now everything is working well. 

Thanks for the kind answers!

tgb417

@ASten1 

Excellent!!!

Please let us know more as you make progress.

--Tom

--Tom