Targeted Random Sampling
Greetings! I have a data set consisting of various columns, one being 'US State' - All states are represented multiple times. I would like to compile a random sample consisting of 2 samples for each state. I've read up on the different sampling methods and don't see how they will fit my use case. I welcome the Community's thoughts and direction. Thank you!
Operating system used: Windows
Best Answer
-
Alexandru Dataiker, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Dataiku DSS Adv Designer, Registered Posts: 1,226 Dataiker
Hi,
The sampling method in the Sample
-> Class rebalance (approx. nb. records)
Should somewhat achieve this but there is not guarantee it will select exactly 2 samples from each item but instead ensure you have sample from all states.
If you need this specific type of sampling you could use Python recipe. Using Pandas group by / sample on the "state" column.
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.sample.html
Answers
-
Thank you! I will give both a try.