Schema Configuration Issue with "Sampling" Recipe Using Python API

Guillaume5
Guillaume5 Registered Posts: 8

Hello everyone,

I am currently facing a problem using the Python API of Dataiku DSS to configure a "sampling" type recipe. My goal is to prevent the automatic conversion of the output schema when executing the recipe, specifically ensuring that data types like NUMERIC are not changed to FLOAT.

Here is the approach I have taken so far:

  1. I am using get_recipe_raw_definition() to obtain the current configuration of the parameters.
  2. I attempt to update engineParams, specifically sqlPipelineParams['overwriteOutputSchema'] = False.
  3. I apply the modifications using set_payload() and then save with recipe_settings.save().

However, I encounter the following error while using the Python API:

DataikuException: com.dataiku.common.server.DKUControllerBase$MalformedRequestException: Could not parse a SerializedRecipeAndPayload from request body, caused by: JsonSyntaxException...

It appears that the format of the payload I am sending is incorrect, and I can't seem to identify what needs to be adjusted to conform to the API's expectations.

I am open to any suggestions on:

  • The proper way to prevent schema modification using the Python API.
  • Recommended practices for configuring and saving engine parameters in "sampling" recipes.
  • Any documentation or resources specific to the Python API that could clarify the structure expected.

Thank you in advance for your assistance!

Guillaume

Operating system used: windows

Operating system used: windows

Answers

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron

    Please post your full code snippet.

  • Guillaume5
    Guillaume5 Registered Posts: 8
    def configure_recipe_schema_update(recipe_name):
    """
    Configure the Dataiku recipe to prevent the conversion of output schema data types.

    Arguments:
    - recipe_name: Name of the recipe to modify.
    """
    # Connect to the Dataiku client and obtain the default project
    client = dataiku.api_client()
    project_key = dataiku.default_project_key()
    project = client.get_project(project_key)

    try:
    # Retrieve the recipe by its name
    recipe = project.get_recipe(recipe_name)
    recipe_settings = recipe.get_settings()

    # Retrieve the raw configuration for exploration
    recipe_definition = recipe_settings.get_recipe_raw_definition()

    # Ensure `overwriteOutputSchema` and `disableOutputSchemaAutoUpdate` are false
    params = recipe_definition['params']['engineParams']

    # Set the rule for the engine partially in database
    params['sqlPipelineParams']['overwriteOutputSchema'] = False

    # Also ensure automatic update is disabled
    recipe_definition['neverRecomputeExistingPartitions'] = True
    # params['disableOutputSchemaAutoUpdate'] = True

    # Save and apply the modifications
    recipe_settings.save()

    print(f"Recipe '{recipe_name}' configuration updated to prevent conversion of output data types.")

    except Exception as e:
    print(f"Error when configuring the recipe '{recipe_name}': {e}")
  • FlorentD
    FlorentD Dataiker, Dataiku DSS Core Designer, Registered Posts: 27 Dataiker

    Hi,

    Some hints/remarks:

    • Can you tell us how you achieved having a numeric data type, I don't see them in the default data types (maybe a custom type or meaning)?
    • payload of the recipe is globally the configuration of the recipe. The outputSchema is not part of the configuration of the recipe itself.
    • To change the sqlPipelineParams , you should so:
    recipe = project.get_recipe("YOUR_RECIPE_NAME")
    settings = recipe.get_settings()
    settings.get_recipe_raw_definition().get('params').get('engineParams').get('sqlPipelineParams')['overwriteOutputSchema']= False
    settings.save()
    

    But this won't help you.

    I would rather go with recipe.compute_schema_updates() ( https://developer.dataiku.com/latest/api-reference/python/recipes.html#dataikuapi.dss.recipe.DSSRecipe.compute_schema_updates ) and then apply them: https://developer.dataiku.com/latest/api-reference/python/recipes.html#dataikuapi.dss.recipe.RequiredSchemaUpdates.apply

    Another way to achieve your goal is to save the output schema before running the recipe and put it back after running (if feasible).

    I hope this helps.

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron

    Hi, I don't see any use of set_payload() in your code snippet. Furthermore params['sqlPipelineParams'] doesn't have a 'overwriteOutputSchema' property (see below). Your code has all the markings of being generated via GenAI. Is that the case?

    image.png
  • Guillaume5
    Guillaume5 Registered Posts: 8

    Yes, the code snippet was influenced by AI-generated suggestions as part of seeking optimized ways to handle Dataiku configurations. My primary aim is to ensure that the schema is not modified during recipe execution, particularly preventing automatic conversions from NUMERIC to FLOAT.

  • Alexandru
    Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,288 Dataiker

    Hi @Guillaume5 ,

    That's for the details, indeed, if you want to avoid type conversion in the Python recipe, you can generally use infer_with_pandas=False and where needed use_nullable_integers

    https://developer.dataiku.com/latest/api-reference/python/datasets.html#dataiku.Dataset.get_dataframe

    Another way to avoid having the schema modified is to write the schema and write_dataframe instead of write_with_schema.
    There are no dataset-level parameters that avoid updating the schema.

    If you are not getting the desired results with infer_with_pandas =False or write_dataframe, please open a support ticket with an example recupe. We can review further; please share job diagnostics.


    Thanks

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,501 Neuron

    @Guillaume5 In the future I would recommend you always clarify that you are posting GenAI code when you ask for help in technical forums as otherwise you end up with humans trying to debug Gen AI code which is fruitless. LLMs are prone to allucinate and produce code that can not work. Also my advice is that if you want to learn the Dataiku API you code your calls by hand by looking at the API documentation. That's the best way to learn the Dataiku Python API. While using LLMs may appear to be faster you will not learn and you will get stuck with allucinations.

    Now to your issue. I think what you are asking for points to a lack of understanding of how Dataiku works so I am going to expand on this to see if we can pinpoint your problem. As FlorentD has said changing the sqlPipelineParams is useless to avoid changing data types. This SQL pipelines are complete different feature not relevant to your problem. While Alexandru is correct is pointing that you can control a Python recipe schema output, you want to influence a Sampling recipe output, not a Python one. I think your underlying issue is lack in understanding how managed datasets work. In general any dataset Dataiku writes to is a managed dataset and Dataiku has full control of it. In the case of the SQL datasets Dataiku will handle the create the table statement, defining the data types and dropping the dataset as well when needed (schema changes or when the dataset is deleted). You can not change this. The data types that Dataiku decides to use are based on the internal Dataiku data types in the dataset itself (which you can control using a Prepare recipe and the formula languange) and the what those internal Dataiku data types map to in each database technology (which again you can't change). Therefore what you want to do is not feasable for managed datasets.

    So having clarified this we come to your issue. Why do you feel the need to ensurr that data types like NUMERIC are not changed to FLOAT? What exactly are trying to achieve? Are you writing to an external table perhaps? Please explain your requirement, not how you think you can achieve it.

Setup Info
    Tags
      Help me…