Pyspark - Read sampled data from dataset

Options
skandagn
skandagn Registered Posts: 8 ✭✭✭

How do we sample data through code using get_dataframe method on pyspark code? Its available in python code using sampling parameter but couldn't through pyspark dataset.

df = dkuspark.get_dataframe()

Thanks,

Skanda

Answers

  • Catalina
    Catalina Dataiker, Dataiku DSS Core Designer, Registered Posts: 135 Dataiker
    edited July 17
    Options

    Hi @skandagn
    ,

    You could use the method sample() to get random sample records from the dataset.

    This is an example on how to use it in DSS:

    # -*- coding: utf-8 -*-
    import dataiku
    from dataiku import spark as dkuspark
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    
    sc = SparkContext.getOrCreate()
    sqlContext = SQLContext(sc)
    
    # Read recipe inputs
    input= dataiku.Dataset("crm_data__1_")
    input_df = dkuspark.get_dataframe(sqlContext, input)
    
    #computes the output, as a SparkSQL dataframe
    output_df = input_df.sample(False, 0.1, seed=0) #sampling 10% of rows dataset
    
    # Write recipe outputs
    output = dataiku.Dataset("output")
    dkuspark.write_with_schema(output, output_df)

Setup Info
    Tags
      Help me…