How to import code from Git into a DSS project library

Dataiku
Dataiku Administrator, Dataiker, Alpha Tester Posts: 88 Administrator

Can you import code from Git to be used within a Dataiku DSS project? Yes!

An important end goal of writing code is to be able to reuse it, whether within a DSS project, across projects within a DSS instance, or for projects external to DSS.

To this end, you can define code libraries within DSS that contain reusable code, and you can connect these libraries to remote git repositories.

For example, if you have code that has been developed outside of DSS and is available in a Git repository (for example, a library created by another team), you can import this repository (or a part of it) in the project libraries, and use it in any code capability of DSS (such as recipes, notebooks, or web apps).

This short video summarizes how to import code from a Git repository into a DSS project library:

Where can I find more information?

What’s next?

  • Reusing code is key to collaboration. Consult the reference documentation to learn more about reusing Python or R code.
Tagged:

Comments

  • jefffriesen
    jefffriesen Registered Posts: 2 ✭✭✭✭

    What kind of code from a git repository is available? Just Python and R? Or can you call a compiled jar file? (I'm writing code in Clojure, which compiles down to jars)

  • Mattsco
    Mattsco Dataiker, Registered Posts: 125 Dataiker

    Hi,

    In code libraries, you can pull any git repository.
    But it's made for python and R code, to import it in Python or R recipes.

    Matt

  • jefffriesen
    jefffriesen Registered Posts: 2 ✭✭✭✭

    Ok, I can call jar files from Python AFAIK. The biggest friction for me with Dataiku is not being able to use our own languages for code recipes. I think Python and R are probably the best choices if you had to pick 2, but limiting for us.

    GraalVM has been getting a lot of traction and could open up a lot more options for Dataiku users: https://www.graalvm.org

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    Once you have downloaded a git repository or part of a git repository into a project's library. Is there a way to reference ipython notebooks that are in the library?

    My first attempt did not yield any results. I suspect a problem with which path the notebooks part of dss looks at.

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker

    Hi,

    This is not possible at the moment. We are looking at enhancements in that regards. A first step in DSS 8.0 will be the ability to upload a .ipynb file directly from the DSS UI.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    Cool. I'm looking forward to updates in this area.

    I'm starting to investigate if I can do anything with the Dataiku's VS Code Plugin for DSS and the new VS Code's updates for github integration. MS just shared some updates at GitHub's Satellite Virtual 2020 online conference today.

    https://githubsatellite.com/schedule/#what-every-github-user-should-know-about-vs-code

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Clément_Stenac
    ,

    On a related topic, GITHUB is currently also being used to share a variety of data sources. For example the COVID Policy Tracker, and the John Hopkins University COVID-19 Data. Using git to clone the data is really easy. However, if you do that to the library. Again finding and using that data can be a real challenge. I eventually found the data under the config directory under the dss-home directory.

    Has anyone worked out a simple way to get a managed folder setup to be the destination of a git clone operation? If so how has that worked out in terms of maintenance?

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    edited July 17

    Hi,

    This should be doable with a simple Python recipe that takes the folder as output and that clones the repository into the path of the folder.

    Something like (NB: sample code, not actually tested)

    import dataiku, subprocess
    
    path = dataiku.Folder("yourfolderid").get_path()
    
    subprocess.check_call(["git", "clone", "YOUR_URL", path])

    For a public repository, this will work out of the box. For a private repository, you'd need to have either one key on the server (if your instance does not run UIF) or one key for each impersonated user (if your instance runs UIF).

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron
    edited July 17

    @Clément_Stenac
    ,

    Thanks for the suggestion.

    This ended up working. I ended up using the "coded" name of the folder. "qzTVGl7c" not the human-readable version of the name of the folder I provided when setting up the recipe.

    I'm not clear when I can use the human-readable name and when an internal name like "qzTVGl7c" must be used.

    Now I have to figure out how to refresh the folder.

    Looks like code like

    import dataiku, subprocess
    path = dataiku.Folder("yourfolderid").get_path()
    subprocess.check_call(["git", "-C", path, "pull"])

    might act as a re-fresh.

Setup Info
    Tags
      Help me…