Pyspark - Read sampled data from dataset
Skanda Gurunathan
Registered Posts: 8 ✭✭✭
How do we sample data through code using get_dataframe method on pyspark code? Its available in python code using sampling parameter but couldn't through pyspark dataset.
df = dkuspark.get_dataframe()
Thanks,
Skanda
Answers
-
Hi @skandagn
,You could use the method sample() to get random sample records from the dataset.
This is an example on how to use it in DSS:
# -*- coding: utf-8 -*- import dataiku from dataiku import spark as dkuspark from pyspark import SparkContext from pyspark.sql import SQLContext sc = SparkContext.getOrCreate() sqlContext = SQLContext(sc) # Read recipe inputs input= dataiku.Dataset("crm_data__1_") input_df = dkuspark.get_dataframe(sqlContext, input) #computes the output, as a SparkSQL dataframe output_df = input_df.sample(False, 0.1, seed=0) #sampling 10% of rows dataset # Write recipe outputs output = dataiku.Dataset("output") dkuspark.write_with_schema(output, output_df)