Mentioning the column value in the if condition output along with the string in the prep recipe

Tsurapaneni
Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

Hi Team,

I have a use case where within the condition statement i have to mention the column value in between the string value. Giving an example:

if(and(B >1), 'B column is (column value from the below dataset (2)) times the A column', B)

The output of the dataset should look this way which is the "D" column value should be a mix of both string and the value from column B.

NOTE: Engine is "InSQL database" not the local stream. kindly consider this for the replies.

A. B. C. D

1. 2. str. B column is 2 times the A column

Thanks !

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @Tsurapaneni
    ,

    Not sure if I understand your use case completely, but maybe this is what you are looking for?:

    if(B > 1, concat('B column is ', B, ' times the A Column'), B)

    Hope it helps!

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Hi @Ignacio_Toledo
    ,

    Thank you for this reply, it worked for me ! I have one more question on the count of the dss dataset. Is there any way I can get the count of the total number of rows in the dataset through a formula in the prep recipe ?

    For now I am using the group by recipe to get the count but I would like to create a column storing all the count of the rows in the dataset. Please let me know if we can achieve it through the formula processor in the prep recipe or by any other processors.

    NOTE: I am using the "Insql database" engine not the local one.

    Regards !

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Happy to help @Tsurapaneni
    !

    Sadly I'm not aware of a method to add a column with the number of rows in the dataset using the insql database engine with the prepare recipe. Maybe someone else have a hint?

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Hi @Ignacio_Toledo
    ,

    May be any way where we can specify the row value based on other column should also work for my use case for now. if at all the prep recipe wouldn't help me to add a column which has the count of the records in it.

    eg for my ask: consider there are 3 columns and should create a new column.

    A B C
    red 1 0
    blue 3 1

    I would like to retrieve the value from the column A whose column Value B is highest

    Steps like:

    1. Getting the highest value for the column 'B' (which is 3)

    2. Retrieving the value from the 'A' column whose Column B value is the highest (which is blue)

    Highest of B column is '3' and the concerned 'A' column is blue so the column D has the value 'blue' throughout it's column (as the max of B column value is 3 and to that row A column has the value 'blue').

    Please let me know if we can make this happen through any recipes in dataiku.

    Thanks !

  • AgatheG
    AgatheG Dataiker, Registered Posts: 37 Dataiker
    edited July 17

    Hi Tsurapaneni,

    If I understand your need correctly, an easy way to go would be via a Python recipe.

    For instance, you could append the column D like this:

    dataset = dataiku.Dataset(YOUR_DATASET_NAME)
    df = dataset.get_dataframe()
    
    i = df.B.idxmax()
    df["D"] = df.A[i]

    Please note that idxmax returns the index of the first row in your dataset where the column B has its highest value. So if the highest value in column B is present on multiple rows, the first value at column A where B is at its highest value will be taken to create column D.

    Meaning if you have:

    A B C
    red 1 0
    blue 3 1
    green 3 0

    Column D would be filled with 'blue' (not 'green'). Depending on your dataset, this may (or may not) matter.

    Hope this helps!

    Agathe

  • Tsurapaneni
    Tsurapaneni Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Registered Posts: 41 ✭✭✭✭

    Hi,

    Thank you for the response! I am looking for the solution using the prep recipe but not in the python code recipes.

Setup Info
    Tags
      Help me…