dkuspark - load selected columns and not the entire DSS dataset

chanbulgin · October 2022

Hi:

I have a large DSS dataset (>1000 columns) and I need to load several columns for a recipe. I would like to load just the columns I need and not the entire dataset.

The default syntax is df = dkuspark.get_dataframe(sqlContext, "dss_dataset") and this loads the entire thing.

Is there a syntax to limit the columns similar to this simple SQL:

SELECT col1, col2, col3 FROM dataset

???

chanbulgin · October 2022

I have answered my own question. I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark. My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.

Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.

CoreyS · October 2022

Thank you for sharing your solution with the rest of the community @chanbulgin
!

dkuspark - load selected columns and not the entire DSS dataset

Best Answer

Answers

Categories

Setup Info

Tags