sample dataset with column unique values

DK
Level 1
sample dataset with column unique values

Hello,

I would like to sample the dataset based on the unique values from a particular column. So, after selecting 100 random unique values from a particular column (the column has values that appear multiple times), I would like to load all the rows that contain any of these 100 values in this column.

Thank you in advance!


Operating system used: Windows 10

0 Kudos
3 Replies
SarinaS
Dataiker

Hi @DK,

There are a couple different approaches I can think of that you could take here. One would be using visual recipes, but likely would require a couple different steps. You could:
(1) Create a distinct recipe on the column in question 
(2) Create a filter recipe based off of the results of the distinct recipe with the sampling method set like so, to select a random 100 distinct values:

Screenshot 2023-06-21 at 5.43.20 PM.png

(3) You would then create a Join recipe from your distinct 100 values dataset to your original dataset, joining on the column question, which will pull all records from the original dataset that match with the 100 random values. 

Another option would be to use a Python recipe, which would allow you to do each of these in one step. For example, you could do something like this:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu

# Read recipe inputs
Orders_enriched_prepared = dataiku.Dataset("<YOUR_INPUT_DATASET>")
outputdistinct_df = Orders_enriched_prepared.get_dataframe()

# get all unique values from the column in question 
subset =  outputdistinct_df['YOUR_COLUMN_NAME'].unique()
# gets a list of 100 numbers between 0 and the length of your subset array - 1
randomlist = random.sample(range(0, len(subset) -1), 100)
final_column_list = []
# use each index to get random values from your list 
for index in randomlist:
    final_column_list.append(subset[index])
# now simply check if your column isin the list of 100 random values 
outputdistinct_df = outputdistinct_df[outputdistinct_df['YOUR_COLUMN_NAME'].isin(final_column_list)]


I hope that's helpful! 

Thanks,
Sarina

  

DK
Level 1
Author

Hi @SarinaS ,

Thank you for your response. In my question I meant whether there is such a functionality without having to load the entire dataset with the "get_dataframe()" method, since I would like to sample the dataset because it is too large in some cases. I was wondering if there as any parameter like "random-column-unique" or something to that effect when sampling the dataset...

Thank you in advance!

0 Kudos
SarinaS
Dataiker

Hi @DK,

I believe that in all cases (including Python) you'll need to read in the entire dataset to sample the column. I would suggest creating a dataset just of the single column, so that you can do random sampling on simply this column. Using a left join recipe based on a random sample of 100 values with this single-column dataset as your left-join dataset will hopefully be the best way to reduce the data.

Are you running into any particular errors as well? We can also talk about different potential storage options that could help, depending on what type of issue you are running into. 

Thanks,
Sarina 

0 Kudos