Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

The python process died(killed - maybe out of memory ?) + how to change it to SQL instead

ShrimpMania
Level 1
The python process died(killed - maybe out of memory ?) + how to change it to SQL instead

 I was trying to filter the datasets by using python inside DSS via Jupyter but then I noticed when there are datasets having many rows, it stops and shows the error. Meaning It runs one or two datasets and stop.

 

The python process died(killed - maybe out of memory ?)

 

 I have some questions regarding this problem.

 

1. is there a way to filter 'CA' first and get the data frame so I can keep working in python? or if there's a better way. I'm quite new to dataiku.

2. How should I change it to SQL? will it be helpful to avoid getting the error? I found out in Dataiku the type is "sql_query" but then I don't know what to do next. I'm just getting errors saying like sqlqueryrecipecreator has no attribute(?).

import dataiku
client = dataiku.api_client()

project = client.get_project("my_project_key")


input_datasets = ["datast1","dataset2", ... "dataset122"]

for input_dataset_name in input_datasets:
    output_dataset_name = input_dataset_name + "CA_filtered"
    
    builder = project.new_recipe("python")
    
    builder.with_input(input_dataset_name)
    
    builder.with_new_output_dataset(output_dataset_name, "connection_name")
    
    builder.with_script(f"""
    import dataiku
    from dataiku import recipe
    input_dataset = recipe.get_inputs_as_datasets()[0]
    output_dataset = recipe.get_outputs_as_datasets()[0]

    df = input_dataset.get_dataframe()
    df = df[df['CNTRY'] == 'CA']
    output_dataset.write_with_schema(df)
    """)
    
    recipe = builder.create()
    job = recipe.run()

 

0 Kudos
6 Replies
Turribeach

What you need to do is to first do a visual Filter recipe and set the Engine as SQL (rather than DSS / Internal) which will run in SQL and will not attempt to load all that data in memory before the filter is applied. Then use that as input of your Python recipe. 

0 Kudos
ShrimpMania
Level 1
Author

Hi @Turribeach , Thank you for your answer but I don't get it. could you please give me a detailed explanation? so you mean I should do a visual Filter/Sampling * 122 times and setting the engine as SQL and write the sampling code via python?

 All I need to do is just filter and get only rows having "CA" value(Column:country). As the datasets' schema are different so I can't use the stack.

 I have all the datasets in snowflake DB and all of them are already imported in my project but should store them in a different snowflake connection. that's why I wrote connection name in my code.

0 Kudos

so you mean I should do a visual Filter * 122 times and setting the engine as SQL?

Yes, I think you are getting confused here between a Filter recipe and Sampling. These are two different things. Sampling is a way of showing you a part of your dataset for exploration and for you to build other recipes. It has no impact in the output. You don't need to sample anything to run a recipe or a full flow. So instead of using Python recipes use Filter recipes which you can run in SQL in your Snowflake DB so the data will not be loaded in memory and will remain filtered in Snow in a new table.

 

0 Kudos
ShrimpMania
Level 1
Author

 I know the difference but I wrote filter/sample as it's written together in the visual recipe on Dataiku. I got your point.

 

 I thought I could do it more effectively in SQL using python Jupyter as it's so time consuming to 1. click on filter(visual recipes) 2. setting the output dataset(name and store into) 3.setting the filter and run * 122 times.

 I don't use snowflake directly and all I can see is the blue box datasets which were imported. I'm not sure if you're talking about the Settings - Connection - Mode - SQL query. I'm not sure if it's the best way that I can do because I prefer to keep them as original dataset and filter the dataset but if repeating 122 times is the only way I can in order to get it properly then I would try this solution as the code I tried keep stoping if the data size is too big.

0 Kudos

Of course you shouldn't do this manually! You can create a visual recipe using the Dataiku API. 

0 Kudos
ShrimpMania
Level 1
Author

 Meaning I can create the visual recipe(filtering) via Jupyter python? I tried to find what document I can refer to like I did the code I provided in the post but I couldn't. Could you please tell me how I can do that?

 builder = project.new_recipe("python")

 

Can I make it by revising some of my code like python to sampling? As far as I know in new_recipe it has to be a type and in Dataiku the type name for sample/filter is sampling.

 

I checked Dataiku Developer Guide but I couldn't find how I had to do it for the filtering recipe.

0 Kudos