Flow Zone Reuse? Can one flow zone be reused from multiple datasets

tgb417 · December 2021

I've been working on a number of file systems to process data about the files on various disks. The flow zone I've created looks like this.

Image showing: A flow zone with 5 datasets 2 Shell Recipes 2 Visual Preparation Recipes and 1 Join Recipie

There are two simple shell scripts that gather data about the same file volume, two preparation recipes that clean up that data, and 1 Join recipe that brings both sets of data together about the one file system.

For each file system I'm processing I'm creating a new flow zone. With the same steps. I have to repeat this for a number of volumes attached to my host. And these volumes will change over time. Therefore the only real difference is the path where the files come from will change.

Part of the reason to breath the flow zone into two paths at the beginning is that I want to be able to control when each of these recipes runs. Because the MD5 shell scripts take about 3/4 of an hour for every 100 GB that I need to process. Where pulling the stats about files takes 1/10 the time.

My question for the community. Has anyone worked out a way to re-use a single flow zone for multiple datasets? This would save a bunch of time copying flow zones. It would save time managing multiple flow zones, and increase reliability because all datasets would get the same processing.

Thoughts?

Operating system used: Mac OS 10.15.7

Keiji · December 2021

Hello Tom,

Thank you so much for the post on Community.

DSS has the "Application-as-recipe" feature, which packages a project's flow into a reusable recipe for other datasets. By defining a flow in a project and converting it into an Application-as-recipe, you can reuse the flow for other datasets. This feature might be useful for your use case. Please see this DSS document https://doc.dataiku.com/dss/latest/applications/application-as-recipe.html for the details of this feature.

I hope this would help.

Sincerely,
Keiji, Dataiku Technical Support

tgb417 · December 2021

@KeijiY

Application-as-recipe looks promising. However, I feel like I need a few more details about how to use this feature set. Is there any training material or more detailed examples on the use of this feature?

--Tom

Oh, I think I may have found some additional information that might be useful to me to get started. https://knowledge.dataiku.com/latest/courses/o16n/dataiku-applications/create-app-as-recipe.html

Keiji · December 2021

@tgb417
Yes, you can refer to the knowledge base article you mentioned for the details of the feature. Please let us know if you have any further questions regarding the feature.

Sincerely,
Keiji, Dataiku Technical Support

tgb417 · December 2021

@KeijiY

How do code updates work in the Application-as-recipe senario.

I create the application-as-recipe. I put it into production. Now I find a bug in the application or a case that did not work as expected. How do I update that application-as-recipe?

--Tom

Keiji · December 2021

Hello @tgb417
,

When you run an Application-as-recipe, the latest project flow will be automatically copied and used [DSS doc]. So, it would be fine to just update the flow of the project of the Application-as-recipe.

tgb417 · January 2022

@KeijiY
,

I’ve worked my way through the knowledge base article on application as recipe, and I have been able to make it work as written. However, this is a very specific multi step recipe on making a section of the haiku t-shirt example into a reusable component.

However, I’m having a hard time generalizing this specific set of instructions into a better understanding of Dataiku applications and more specifically application as recipe so that I can create my own to do the kind of thing I’d like to do. Is there a more general set of instructional material on this subject that would scaffold the purpose for taking each of the steps listed? With an eye on being able to create other such recipes.

cc: @CoreyS

Keiji · January 2022

Hello @tgb417
,

Thank you so much for the feedback. I have shared the feedback with our Educational Services team.

We really appreciate your feedback.

Sincerely,
Keiji, Dataiku Technical Support

Marlan · February 2022

Hi Tom (@tgb417
),

I saw your related post over in Product Ideas:

https://community.dataiku.com/t5/Product-Ideas/Let-users-create-and-publish-custom-prepare-recipe-processors/idc-p/23252/highlight/true#M509

I'm sure you've thought of this but could you package the process into a Python function and then call that where needed? I'd think you might be able to do this by calling the API to create the recipes needed (i.e., replicating the recipes you have already defined). Or maybe one could do everything directly in Python?

I was excited when applications-as-recipes was first announced but then learned that it doesn't support writing output to SQL table as an in-database operation. I get that this is hard and maybe not doable but it certainly limits the use at least for us as almost all of our processing is done in database on sometimes pretty large SQL tables. I did find as well that the learning curve was a bit steep.

I ended up being able to accomplish the same goal (package common logic for sharing within the company) using plugins (i.e., custom recipes).

We have also developed a plugin recipe to run the script from another recipe. It replaces the inputs and outputs of the other recipe with those of the current recipe and then runs the script. It is limited to SQL Script recipes but it has been a real boon for situations where we want to run the same preparation process for a train flow and scoring flow.

I realize none of that helps you. Just commenting on the theme of wanting to share processes across different parts of a flow.

Marlan

tgb417 · February 2022

@Marlan

Thanks for the time to share your thoughts.

My use case is about transitioning summary data from just about 1 million files into a PostgreSQL database for evaluation. Unfortunately, It not SQL on both input and output of the recipe.

I certainly started with Python code and was able to get some of these things to work. However I ran into challenges with scaling the Python approach on my limited hardware. In this case I'm using DSS's Shell Recipes because the Dataiku Python library seems to be buggy for me when working with more than 100,000 files in a file system directory structure. And the performance of the Dataiku Python Library has been between 1-2 orders of magnitude slower in dealing with file than doing things through the shell recipe based approach.

Regarding plugins, I'm fascinated by this concept I enjoy being able to leverat the many pluging that are out there. However, I have not fully worked out how to build my own plugins from scratch. There are several project for which I think this might be helpful. At some point I'd love to learn from you or someone else more about setting up plugins from scratch. I've started the academy course. However, not yet finished. Is that the best way to learn how to create plugins?

Your plugin recipe to run the script from another recipe sounds fascinating. I'd like to learn more when we have a moment.

I agree that process / flow reuse is an important area for DSS improvement. The features may already be available in the system. However, I think that there are steps that could be taken to get this to a level of simplicity that we could consider plugin creation to be part of every day AI.

--Tom

Marlan · February 2022

Hi Tom (@tgb417
),

Ah yes, I remember seeing your posts about the Python issues with folders with large numbers of files. Sorry, I had forgotten about that.

One other idea - what about using Python to create the shell based recipes? So you'll get the advantage of using the shell based recipes but you'd have a Python script that manages that and that could be put into a library and called from multiple other Python recipes. I'd think that the Python script would create the shell recipes, execute them, and then delete them when done. I've done something similar with SQL recipes. I could share that if it would be helpful.

On the topic of Plugins... I have gone through the Plugin development course (actually to give feedback when it was first released) and thought it was good. It definitely would have been helpful to me when I was first learning to develop plugins. So yes, seems like a good way to start. Looks like there are a couple of other plugin courses now that look to be helpful as well (all in the DSS Customization area).

Keep in mind that once you've learned to develop a type of plugin component (recipe, macro, etc.), you've learned... how to develop that type of plugin component. While there is some general crossover for sure, there is a lot that is unique to each type of component.

We started with plugin macros. These may be a bit easier to understand than some of the others (or maybe I'm just most familiar with them!) Plugin recipes can be super useful though and so worth figuring out how to develop those. We have also developed a couple of plugin scenario components. We have only two actual plugins (utility and production - the first only available on the dev instance and the second deployed to our production instance for use in projects in production). These two plugins contain various components - recipes, scenarios, and macros. We are using our plugins to contain various unrelated components rather than pieces of one solution which is how most of the plugins in the store are set up.

Happy to share more about the plugin recipe that runs another recipe. It has been really helpful. I hated trying to keep duplicate scripts in sync. Maybe we should create a post in the plugins area? There is also the new Inspiration option for sharing tips. Or maybe just direct message me...

Marlan

tgb417 · February 2022

@Marlan

Yeah, seeing a bit of the code you use to create other visual recipes from within Python would be interesting to take a look at.

I'm finding this comment a bit discouraging.

Keep in mind that once you've learned to develop a type of plugin component
(recipe, macro, etc.), you've learned... how to develop that type of plugin
component. While there is some general crossover for sure, there is a lot 
that is unique to each type of component.

As one of the organizers of the NYC Dataiku User Group. I'm wondering if I could invite you to share with the community a bit more about your plugin journey. Let me know what you think in a PM.

Marlan · February 2022

Tom (@tgb417

Here is an excerpt from something I wrote a while back that shows creating a recipe, executing it, and finally deleting it (while it runs you would see the additional recipe in the UI). I'm thinking something like this could a function in the project library and you'd call it from a Python recipe. One issue to work through is that the calling recipe and the created recipe can't share the output dataset. Note that I edited the excerpt a bit to make it a standalone example but may have missed something.

Marlan

import dataiku
from dataikuapi.dss.recipe import SingleOutputRecipeCreator, SQLQueryRecipeCreator, CodeRecipeCreator

input_dataset_name = 'INPUT_DATASET'
input_table_name = 'INPUT_TABLE'
csv_dataset_name = 'CSV_DATASET'

client = dataiku.api_client()
project = client.get_project(dataiku.default_project_key())
project_variables = dataiku.get_custom_variables()


# Set SQL query recipe to populate the CSV file (creating recipe if needed)

# Recipe specs
csv_recipe_name = 'BUILD_' + csv_dataset_name
csv_recipe_query = "SELECT C1, C2, C3 FROM {1};".format(input_table_name)

try:
    csv_recipe = project.get_recipe(csv_recipe_name) # does not raise error if recipe doesn't exist
    set_recipe_payload(csv_recipe, csv_recipe_query)

except:
    # Create the recipe (assume exception was due to recipe not existing)
    builder = SingleOutputRecipeCreator('sql_query', csv_recipe_name, project)
    builder = builder.with_input(input_dataset_name)
    builder = builder.with_output(csv_dataset_name)
    csv_recipe = builder.build()

    set_recipe_payload(csv_recipe, csv_recipe_query)


# Run job to build CSV dataset
job_def = project.new_job_definition_builder()
job_def.with_type('NON_RECURSIVE_FORCED_BUILD')
job_def.with_output(csv_dataset_name)
project.start_job_and_wait(job_def.get_definition())

# Remove CSV file dataset and recipe
csv_recipe.delete()
csv_dataset.clear() # removes folder and file
csv_dataset.delete()

Turribeach · July 2022

Plugin is the way to go for what you want. It takes some time but it's not that hard. Give it a go.

With regards to this statement: "Dataiku Python library seems to be buggy for me when working with more than 100,000 files in a file system directory structure. "

There is no bug at all, just overhead of a high level language that is not optimised for low level file system operations like a Linux shell is. I also found issues processing large amount of files using Python. In one use case I had 1 million XML files I needed to process and Python will run like a dog. However there are few tricks you can use to prevent performance issues when handling lots of small files:

Create subdirectories to "partition" the data. In our case even a Google Cloud Storage was slow handling so many XMLs in a bucket. In the end we created subfolders for each date (YYYY-MM-DD) and now we have few thousand files per subdirectory which improves performance a lot.
Use Shell commands where needed, certain things are just way faster than Python. In another use case I needed to search insde lots of files to identify which files had relevant data we want to load. Nothing I could write in Python would be as fast as grep.
Parallelise the flow by adding multiple execution paths. If your data files can be break down into different types use that. Otherwise you can easily use something like the last digit of the second part of the file created date/time to be able to move files between different branches of your flow.

In the end this is one of the beauties of Dataiku, you can seamlessly switch between Shell, SQL, Python or Visual recipes and use what's best for each step.

Thanks

Christian

Flow Zone Reuse? Can one flow zone be reused from multiple datasets

Answers

Categories

Setup Info

Tags