Debugging with DSS
hi team,
We are working within DSS with a code base that we have built ourselves. We import this code base with a git integration as a library.
Sometimes occurs that we need to debug the code we have build as we would have done in an editor.
However, from a jupyter notebook this is not possible. Therefore, for now we replicate the same enviroment of Dataiku locally and with the same data files in order to be able to debug and continue building our library consisted from different modules.
is it possible to debug within dataiku? we have seen there is a possibility to integrate pycharm as an editor within DSS.
1. Would that allow us to access our repository and do the debugging on the fly?
2. in case of changes would be able to push from pycharm directly to the repository changes that we do in the modules?
Best Answer
-
Hi,
There is indeed no interactive debugger builtin into DSS. However, the various integration capabilities give you a lot of flexibility around that. For example, in your case, you would have at least these possibilities:
1. Do not use DSS-specific IDE integration, debug locally with local data files, push, pull
To my understanding, this is what you do currently:
- Most or all of your busines logic lives as a set of Python modules
- The Python modules are hosted on a Git repository, and this Git repository is imported as a project library through the Git references capabilities
- Then, you develop your Python module locally, using the debugger of your local IDE (i.e. Pycharm)
- This may require you to have some data files locally
- You push the changes to your Git repository
- You go in the DSS library editor and update from Git
- You can then run your recipes in DSS
2. Do not use DSS-specific IDE integration, debug locally with DSS-managed data, push, pull
A refinement on the first solution. Instead of having to manage data locally, leverage the fact that the dataiku Python package (i.e. the one behind dataiku.Dataset().get_dataframe() ) can work outside of Dataiku.
For setup instructions, please see: https://doc.dataiku.com/dss/latest/python-api/outside-usage.html
You would thus work on your libraries locally in Pycharm, but all data access would go through DSS, such that you don't need to worry about copying data, data access to non-files, credentials, ...
You would then as previously push your libraries to Git and then pull them into Dataiku from the libraries editor
3. Use the Pycharm/DSS integration in order to directly edit recipes
While libraries can be pulled directly from Git, recipes cannot. Note that for large-scale projects, we recommend that your recipes contain only minimal glue code and that the bulk of your business logic lies in libraries, which can be managed in Git independently from DSS (see 1. and 2.)
The thing that the Pycharm/DSS integration brings is direct edition of recipes in Pycharm. With that, you can "pull" a single recipe from DSS to Pycharm, edit it in Pycharm, run and debug it in Pycharm (leveraging the outside-of-DSS usage of the dataiku pacakge previously highlighted), and save it. When you save it in Pycharm, it gets synced back to DSS.
-> This allows you to edit, run and debug both libraries and recipes directly in Pycharm
So to answer directly your questions:
1. Would that allow us to access our repository and do the debugging on the fly?
Yes, but you don't actually need any specific integration for that. For libraries, you can already do it using Pycharm capabilities + Git libraries capabilities. Importing the dataiku Python package would allow you to access data remotely without having to copy it.
2. in case of changes would be able to push from pycharm directly to the repository changes that we do in the modules?
Pycharm can do that already, you don't actually need Pycharm/DSS integration for that. Pycharm adds direct edition of recipes (in addition to libraries)
Answers
-
Hi Clement,
Thanks for your reply. I have an additional question though.
Let's consider step 3.
Local Pycharm has access on recipes & data. Thus we are able to debug recipes locally with access to production data. So the recipe is making use of library functions and lets assume that the bug is within the library. Ideally, we would like to be able to adjust the code in your branch and push again to git, merge, update Daiku and run the recipe.
1. When debugging a recipe with Pycharm locally, the library used to run the code is it the Dataiku clone or my local copy?
2. If this is the Dataiku clone can we change it to locally to avoid writing code in two places? if not what is the wow you recommend?
3. Updating the library within Dataiku is a manual process (afak), is there a way to trigger library updates in Dataiku for every merge or update we do on the develop branch on the remote?
4. After updating the library do I need to restart the kernel in pycharm in order to run the recipe with the latest version of my library?
Thank you in advance,
Alexandros
-
Hi,
When you run locally, you are really just running locally in Pycharm. There is no magic or anything specific that DSS does. You simply have loaded the dataiku package locally in order to access the data, but everything else happens in Pycharm just as it would if Dataiku was not involved.
Thus:
1. It would be your local copy
3. At the moment, there is no API to trigger refreshes of libraries in DSS, we have that in our backlog.
4. Here, I don't know Pycharm enough to provide an answer. But it's really just regular Pycharm behavior, there is nothing Dataiku-specific there.