dkuspark - load selected columns and not the entire DSS dataset

Solved!
chanbulgin
Level 1
dkuspark - load selected columns and not the entire DSS dataset

Hi:

I have a large DSS dataset (>1000 columns) and I need to load several columns for a recipe.  I would like to load just the columns I need and not the entire dataset.

The default syntax is df = dkuspark.get_dataframe(sqlContext, "dss_dataset") and this loads the entire thing.

Is there a syntax to limit the columns similar to this simple SQL:

SELECT col1, col2, col3 FROM dataset

???

0 Kudos
1 Solution
chanbulgin
Level 1
Author

I have answered my own question.  I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark.  My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.

Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.


View solution in original post

2 Replies
chanbulgin
Level 1
Author

I have answered my own question.  I'm sure it was obvious to any Spark pros, but I am fairly new to pySpark.  My own Data Engineering team pointed out that Spark uses a lazy evaluation architecture, meaning we can load the entire dataset, but the data is not brought into memory until it needs to be acted upon.

Thus, loading the data in a pySpark dataframe and then executing a select that includes just what I need means that the only data that is retrieved is what I've selected.


CoreyS
Dataiker Alumni

Thank you for sharing your solution with the rest of the community @chanbulgin

Looking for more resources to help you use Dataiku effectively and upskill your knowledge? Check out these great resources: Dataiku Academy | Documentation | Knowledge Base

A reply answered your question? Mark as โ€˜Accepted Solutionโ€™ to help others like you!
0 Kudos