dkuspark - load selected columns and not the entire DSS dataset

chanbulgin Registered Posts: 2


I have a large DSS dataset (>1000 columns) and I need to load several columns for a recipe. I would like to load just the columns I need and not the entire dataset.

The default syntax is df = dkuspark.get_dataframe(sqlContext, "dss_dataset") and this loads the entire thing.

Is there a syntax to limit the columns similar to this simple SQL:

SELECT col1, col2, col3 FROM dataset



Best Answer

  • chanbulgin
    chanbulgin Registered Posts: 2
    Answer ✓

    I have answered my own question. I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark. My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.

    Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.


Setup Info
      Help me…