The python process died(killed - maybe out of memory ?) + how to change it to SQL instead

Options
ShrimpMania
ShrimpMania Registered Posts: 15

I was trying to filter the datasets by using python inside DSS via Jupyter but then I noticed when there are datasets having many rows, it stops and shows the error. Meaning It runs one or two datasets and stop.

The python process died(killed - maybe out of memory ?)

I have some questions regarding this problem.

1. is there a way to filter 'CA' first and get the data frame so I can keep working in python? or if there's a better way. I'm quite new to dataiku.

2. How should I change it to SQL? will it be helpful to avoid getting the error? I found out in Dataiku the type is "sql_query" but then I don't know what to do next. I'm just getting errors saying like sqlqueryrecipecreator has no attribute(?).

import dataikuclient = dataiku.api_client()project = client.get_project("my_project_key")input_datasets = ["datast1","dataset2", ... "dataset122"]for input_dataset_name in input_datasets:output_dataset_name = input_dataset_name + "CA_filtered"builder = project.new_recipe("python")builder.with_input(input_dataset_name)builder.with_new_output_dataset(output_dataset_name, "connection_name")builder.with_script(f"""import dataikufrom dataiku import recipeinput_dataset = recipe.get_inputs_as_datasets()[0]output_dataset = recipe.get_outputs_as_datasets()[0]df = input_dataset.get_dataframe()df = df[df['CNTRY'] == 'CA']output_dataset.write_with_schema(df)""")recipe = builder.create()job = recipe.run()

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    What you need to do is to first do a visual Filter recipe and set the Engine as SQL (rather than DSS / Internal) which will run in SQL and will not attempt to load all that data in memory before the filter is applied. Then use that as input of your Python recipe.

  • ShrimpMania
    ShrimpMania Registered Posts: 15
    Options

    Hi @Turribeach
    , Thank you for your answer but I don't get it. could you please give me a detailed explanation? so you mean I should do a visual Filter/Sampling * 122 times and setting the engine as SQL and write the sampling code via python?

    All I need to do is just filter and get only rows having "CA" value(Column:country). As the datasets' schema are different so I can't use the stack.

    I have all the datasets in snowflake DB and all of them are already imported in my project but should store them in a different snowflake connection. that's why I wrote connection name in my code.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    so you mean I should do a visual Filter * 122 times and setting the engine as SQL?

    Yes, I think you are getting confused here between a Filter recipe and Sampling. These are two different things. Sampling is a way of showing you a part of your dataset for exploration and for you to build other recipes. It has no impact in the output. You don't need to sample anything to run a recipe or a full flow. So instead of using Python recipes use Filter recipes which you can run in SQL in your Snowflake DB so the data will not be loaded in memory and will remain filtered in Snow in a new table.

  • ShrimpMania
    ShrimpMania Registered Posts: 15
    Options

    I know the difference but I wrote filter/sample as it's written together in the visual recipe on Dataiku. I got your point.

    I thought I could do it more effectively in SQL using python Jupyter as it's so time consuming to 1. click on filter(visual recipes) 2. setting the output dataset(name and store into) 3.setting the filter and run * 122 times.

    I don't use snowflake directly and all I can see is the blue box datasets which were imported. I'm not sure if you're talking about the Settings - Connection - Mode - SQL query. I'm not sure if it's the best way that I can do because I prefer to keep them as original dataset and filter the dataset but if repeating 122 times is the only way I can in order to get it properly then I would try this solution as the code I tried keep stoping if the data size is too big.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,708 Neuron
    Options

    Of course you shouldn't do this manually! You can create a visual recipe using the Dataiku API.

  • ShrimpMania
    ShrimpMania Registered Posts: 15
    Options

    Meaning I can create the visual recipe(filtering) via Jupyter python? I tried to find what document I can refer to like I did the code I provided in the post but I couldn't. Could you please tell me how I can do that?

     builder = project.new_recipe("python")

    Can I make it by revising some of my code like python to sampling? As far as I know in new_recipe it has to be a type and in Dataiku the type name for sample/filter is sampling.

    I checked Dataiku Developer Guide but I couldn't find how I had to do it for the filtering recipe.

Setup Info
    Tags
      Help me…