Pyspark - Read sampled data from dataset

skandagn
Level 2
Pyspark - Read sampled data from dataset

How do we sample data through code using get_dataframe method on pyspark code? Its available in python code using sampling parameter but couldn't through pyspark dataset. 

 

df = dkuspark.get_dataframe()

 

Thanks,

Skanda

 

0 Kudos
1 Reply
CatalinaS
Dataiker

Hi @skandagn,

You could use the method sample()  to get random sample records from the dataset.

This is an example on how to use it in DSS:

# -*- coding: utf-8 -*-
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

# Read recipe inputs
input= dataiku.Dataset("crm_data__1_")
input_df = dkuspark.get_dataframe(sqlContext, input)

#computes the output, as a SparkSQL dataframe
output_df = input_df.sample(False, 0.1, seed=0) #sampling 10% of rows dataset

# Write recipe outputs
output = dataiku.Dataset("output")
dkuspark.write_with_schema(output, output_df)

 

0 Kudos