dkuspark - load selected columns and not the entire DSS dataset

chanbulgin
chanbulgin Registered Posts: 2

Hi:

I have a large DSS dataset (>1000 columns) and I need to load several columns for a recipe. I would like to load just the columns I need and not the entire dataset.

The default syntax is df = dkuspark.get_dataframe(sqlContext, "dss_dataset") and this loads the entire thing.

Is there a syntax to limit the columns similar to this simple SQL:

SELECT col1, col2, col3 FROM dataset

???

Tagged:

Best Answer

  • chanbulgin
    chanbulgin Registered Posts: 2
    Answer ✓

    I have answered my own question. I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark. My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.

    Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.


Answers

Setup Info
    Tags
      Help me…