Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi:
I have a large DSS dataset (>1000 columns) and I need to load several columns for a recipe. I would like to load just the columns I need and not the entire dataset.
The default syntax is df = dkuspark.get_dataframe(sqlContext, "dss_dataset") and this loads the entire thing.
Is there a syntax to limit the columns similar to this simple SQL:
SELECT col1, col2, col3 FROM dataset
???
I have answered my own question. I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark. My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.
Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.
I have answered my own question. I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark. My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.
Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.
Thank you for sharing your solution with the rest of the community @chanbulgin!