Activate a Spark pipeline using the Dataiku Python API
Hello,
Is it possible to activate a Spark pipeline using the Dataiku Python API? I am interested in automating the process of running Spark pipelines within my Dataiku projects and would like to know if the Python API provides functionality for initiating and managing these pipelines. If so, could you please provide guidance or examples on how to implement this?"
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,139 Neuron
Conceptually what you are asking does not make sense. The Dataiku Python API is there to interact with Dataiku, not Spark. If you want to automate your existing Spark pipelines from Dataiku then you should be looking a Spark Python API which is what PySpark is. You can use a Python or PySpark recipe to invoke Spark pipelines from Dataiku but there will be limited value on this unless you fully integrate Spark with Dataiku, which brings us to the point.
Dataiku also supports offloading processing to Spark. I would highly recommend you read the documentation in full as there are many different ways to integrate Spark with Dataiku. But in general terms in this pattern the Spark code will come from Dataiku (either from a visual or code recipe) and will execute on the Spark engine/cluster. So in such pattern the Dataiku Python API will be able to execute a Dataiku scenario which then offload the processing work of the Spark recipes to Spark. But this is not what you asked.