dkuspark - load selected columns and not the entire DSS dataset
Hi:
I have a large DSS dataset (>1000 columns) and I need to load several columns for a recipe. I would like to load just the columns I need and not the entire dataset.
The default syntax is df = dkuspark.get_dataframe(sqlContext, "dss_dataset") and this loads the entire thing.
Is there a syntax to limit the columns similar to this simple SQL:
SELECT col1, col2, col3 FROM dataset
???
Best Answer
-
I have answered my own question. I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark. My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.
Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.
Answers
-
CoreyS Dataiker Alumni, Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Registered Posts: 1,150 ✭✭✭✭✭✭✭✭✭
Thank you for sharing your solution with the rest of the community @chanbulgin
!