sample dataset with column unique values

Options
DK
DK Registered Posts: 3

Hello,

I would like to sample the dataset based on the unique values from a particular column. So, after selecting 100 random unique values from a particular column (the column has values that appear multiple times), I would like to load all the rows that contain any of these 100 values in this column.

Thank you in advance!


Operating system used: Windows 10

Tagged:

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 315 Dataiker
    edited July 17
    Options

    Hi @DK
    ,

    There are a couple different approaches I can think of that you could take here. One would be using visual recipes, but likely would require a couple different steps. You could:
    (1) Create a distinct recipe on the column in question
    (2) Create a filter recipe based off of the results of the distinct recipe with the sampling method set like so, to select a random 100 distinct values:

    Screenshot 2023-06-21 at 5.43.20 PM.png

    (3) You would then create a Join recipe from your distinct 100 values dataset to your original dataset, joining on the column question, which will pull all records from the original dataset that match with the 100 random values.

    Another option would be to use a Python recipe, which would allow you to do each of these in one step. For example, you could do something like this:

    import dataiku
    import pandas as pd, numpy as np
    from dataiku import pandasutils as pdu
    
    # Read recipe inputs
    Orders_enriched_prepared = dataiku.Dataset("<YOUR_INPUT_DATASET>")
    outputdistinct_df = Orders_enriched_prepared.get_dataframe()
    
    # get all unique values from the column in question 
    subset =  outputdistinct_df['YOUR_COLUMN_NAME'].unique()
    # gets a list of 100 numbers between 0 and the length of your subset array - 1
    randomlist = random.sample(range(0, len(subset) -1), 100)
    final_column_list = []
    # use each index to get random values from your list 
    for index in randomlist:
        final_column_list.append(subset[index])
    # now simply check if your column isin the list of 100 random values 
    outputdistinct_df = outputdistinct_df[outputdistinct_df['YOUR_COLUMN_NAME'].isin(final_column_list)]


    I hope that's helpful!

    Thanks,
    Sarina

  • DK
    DK Registered Posts: 3
    Options

    Hi @SarinaS
    ,

    Thank you for your response. In my question I meant whether there is such a functionality without having to load the entire dataset with the "get_dataframe()" method, since I would like to sample the dataset because it is too large in some cases. I was wondering if there as any parameter like "random-column-unique" or something to that effect when sampling the dataset...

    Thank you in advance!

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer Posts: 315 Dataiker
    Options

    Hi @DK
    ,

    I believe that in all cases (including Python) you'll need to read in the entire dataset to sample the column. I would suggest creating a dataset just of the single column, so that you can do random sampling on simply this column. Using a left join recipe based on a random sample of 100 values with this single-column dataset as your left-join dataset will hopefully be the best way to reduce the data.

    Are you running into any particular errors as well? We can also talk about different potential storage options that could help, depending on what type of issue you are running into.

    Thanks,
    Sarina

Setup Info
    Tags
      Help me…