How to read parquet file from GCS using pyspark?

Solved!
Chiktika
Level 3
How to read parquet file from GCS using pyspark?

Hi,

My parquet files, stored in GCS are made with a too higher version to be used in DSS in a GCS managed dataset.

So I try to read them via Spark and save it in another dataset.

Doing it with local files is very easy, but how to do it with files stored in GCS?

folder = dataiku.Folder("SpTdwpr2")
path = folder.get_path()

df = sqlContext.read.parquet(f'{path}/test_parquet.parquet')

 

With many thanks for your help.

C.

0 Kudos
1 Solution
SarinaS
Dataiker

Hi @Chiktika ,

I'll walk through a setup that worked for me, and hopefully that will help.  

Here's a bucket I have in GCS, that contains a parquet file:

Screen Shot 2021-02-08 at 1.07.53 PM.png

 I created a managed folder that points to this bucket with the following settings:

Screen Shot 2021-02-08 at 2.57.14 PM.png

Here are a couple of options for using sqlContext.read.parquet to read in parquet files in this folder. In the first example it gets the filenames from a bucket one by one. The printed out filename could be used directly like so: sqlContext.read.parquet('gc://sarina-bucket/dataiku/DKU_HAIKU_STARTER/gcp_parquet_file/part-r-00000.snappy.parquet') The latter example shows reading the directory: sqlContext.read.parquet('gc://sarina-bucket/dataiku/*/*/*.parquet which could also be generated based on the get_info() and list_paths_in_partition() functions.  

Screen Shot 2021-02-08 at 2.49.12 PM.png

And then to write this to a dataset:

Screen Shot 2021-02-08 at 3.32.02 PM.png

I'm not sure if this addresses your use case, so please feel free to add any details if it does not.

 

Thanks,

Sarina 

 

View solution in original post

0 Kudos
3 Replies
SarinaS
Dataiker

Hi @Chiktika ,

I'll walk through a setup that worked for me, and hopefully that will help.  

Here's a bucket I have in GCS, that contains a parquet file:

Screen Shot 2021-02-08 at 1.07.53 PM.png

 I created a managed folder that points to this bucket with the following settings:

Screen Shot 2021-02-08 at 2.57.14 PM.png

Here are a couple of options for using sqlContext.read.parquet to read in parquet files in this folder. In the first example it gets the filenames from a bucket one by one. The printed out filename could be used directly like so: sqlContext.read.parquet('gc://sarina-bucket/dataiku/DKU_HAIKU_STARTER/gcp_parquet_file/part-r-00000.snappy.parquet') The latter example shows reading the directory: sqlContext.read.parquet('gc://sarina-bucket/dataiku/*/*/*.parquet which could also be generated based on the get_info() and list_paths_in_partition() functions.  

Screen Shot 2021-02-08 at 2.49.12 PM.png

And then to write this to a dataset:

Screen Shot 2021-02-08 at 3.32.02 PM.png

I'm not sure if this addresses your use case, so please feel free to add any details if it does not.

 

Thanks,

Sarina 

 

0 Kudos
Chiktika
Level 3
Author

Hi @SarinaS,

 

That's exactly what I did and it works perfectly.

Many thanks.

 

0 Kudos
Chiktika
Level 3
Author

In fact I did not do exactly the same, I did not create a managed folder, I directly read inside my GCS bucket.

bucket = storage_client.lookup_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=bucket_path)
for blob in blobs:
    df = sqlContext.read.parquet(f"gs://bucket_name/{blob.name}")