Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
How do we sample data through code using get_dataframe method on pyspark code? Its available in python code using sampling parameter but couldn't through pyspark dataset.
df = dkuspark.get_dataframe()
Thanks,
Skanda
Hi @skandagn,
You could use the method sample() to get random sample records from the dataset.
This is an example on how to use it in DSS:
# -*- coding: utf-8 -*-
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
# Read recipe inputs
input= dataiku.Dataset("crm_data__1_")
input_df = dkuspark.get_dataframe(sqlContext, input)
#computes the output, as a SparkSQL dataframe
output_df = input_df.sample(False, 0.1, seed=0) #sampling 10% of rows dataset
# Write recipe outputs
output = dataiku.Dataset("output")
dkuspark.write_with_schema(output, output_df)