How to import code from Git into a DSS project library

Can you import code from Git to be used within a Dataiku DSS project? Yes!

An important end goal of writing code is to be able to reuse it, whether within a DSS project, across projects within a DSS instance, or for projects external to DSS.

To this end, you can define code libraries within DSS that contain reusable code, and you can connect these libraries to remote git repositories.

For example, if you have code that has been developed outside of DSS and is available in a Git repository (for example, a library created by another team), you can import this repository (or a part of it) in the project libraries, and use it in any code capability of DSS (such as recipes, notebooks, or web apps).

This short video summarizes how to import code from a Git repository into a DSS project library:

Where can I find more information?

What’s next?

  • Reusing code is key to collaboration. Consult the reference documentation to learn more about reusing Python or R code. 
Comments
jefffriesen
Level 1

What kind of code from a git repository is available? Just Python and R? Or can you call a compiled jar file? (I'm writing code in Clojure, which compiles down to jars)

Mattsco
Dataiker

Hi, 

In code libraries, you can pull any git repository.
But it's made for python and R code, to import it in Python or R recipes. 

Matt

 

jefffriesen
Level 1

Ok, I can call jar files from Python AFAIK. The biggest friction for me with Dataiku is not being able to use our own languages for code recipes. I think Python and R are probably the best choices if you had to pick 2, but limiting for us. 

GraalVM has been getting a lot of traction and could open up a lot more options for Dataiku users: https://www.graalvm.org

Once you have downloaded a git repository or part of a git repository into a project's library.  Is there a way to reference ipython notebooks that are in the library?

My first attempt did not yield any results.  I suspect a problem with which path the notebooks part of dss looks at.

Clément_Stenac
Dataiker

Hi,

This is not possible at the moment. We are looking at enhancements in that regards. A first step in DSS 8.0 will be the ability to upload a .ipynb file directly from the DSS UI.

Cool.  I'm looking forward to updates in this area. 🙂

I'm starting to investigate if I can do anything with the Dataiku's VS Code Plugin for DSS and the new VS Code's updates for github integration.  MS just shared some updates at GitHub's Satellite Virtual 2020 online conference today.

https://githubsatellite.com/schedule/#what-every-github-user-should-know-about-vs-code   

@Clément_Stenac ,

On a related topic, GITHUB is currently also being used to share a variety of data sources.  For example the COVID Policy Tracker, and the John Hopkins University COVID-19 Data.  Using git to clone the data is really easy.  However, if you do that to the library.  Again finding and using that data can be a real challenge.  I eventually found the data under the config directory under the dss-home directory.

Has anyone worked out a simple way to get a managed folder setup to be the destination of a git clone operation?  If so how has that worked out in terms of maintenance?  

Clément_Stenac
Dataiker

Hi,

This should be doable with a simple Python recipe that takes the folder as output and that clones the repository into the path of the folder.

Something like (NB: sample code, not actually tested)

import dataiku, subprocess

path = dataiku.Folder("yourfolderid").get_path()

subprocess.check_call(["git", "clone", "YOUR_URL", path])

 

For a public repository, this will work out of the box. For a private repository, you'd need to have either one key on the server (if your instance does not run UIF) or one key for each impersonated user (if your instance runs UIF).

@Clément_Stenac ,

Thanks for the suggestion.

This ended up working.  I ended up using the "coded" name of the folder.  "qzTVGl7c" not the human-readable version of the name of the folder I provided when setting up the recipe. 

I'm not clear when I can use the human-readable name and when an internal name like "qzTVGl7c" must be used.

Now I have to figure out how to refresh the folder.  

Looks like code like

import dataiku, subprocess
path = dataiku.Folder("yourfolderid").get_path()
subprocess.check_call(["git", "-C", path, "pull"])

might act as a re-fresh.

 

 

 

Share:

Labels

?
Labels (2)
Version history
Publication date:
10-01-2020 08:04 PM
Version history
Last update:
‎01-10-2020 09:04 PM
Updated by:
Contributors