Not able to read text files using Pyspark in Dataiku
Hi, I'm trying to read the text files from my managed folder using pyspark in Dataiku. I have created RDD but when I use collect() in RDD it throws error that path doesn't exist. Below is the code :
# -*- coding: utf-8 -*-
import dataiku
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from datetime import datetime
import io
import pandas as pd
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
#Initialize SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
# Read recipe Dataset
folder = dataiku.Folder("abcftdgh")
files= folder .list_paths_in_partition()
#print(files)
# Filter the .txt files
txt_files = [file for file in files if file.endswith(".txt")]
#print(txt_files )
# Iterate each .txt files
for file in txt_files:
name= file[1:8]
timestamp_str = datetime.strptime(file[9:23], "%Y%m%d%H%M%S") # Parse the timestamp string
timestamp = timestamp_str.strftime("%Y-%m-%dT%H:%M:%S.%f+0000")
print(file)
print("Name:",name)
print("Timestamp:",timestamp)
#Read the text file into an RDD
lines = sc.textFile(file)
print(lines)
llist = lines.collect()
for line in llist:
print(line)
This code prints the file name along with Path and derived columns from the file name i.e. name and timestamp. If I only print lines, it prints "File path MapPartitionsRDD[145] at textFile at <unknown>:0" and throws path not find error if i use collect(). Please suggest me solution for reading text files from a managed folder and creating RDD and dataframes using these data. Your assistance would be really helpful for me. Thank you!
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,172 Neuron
Hi can you please edit your post and repost your code using a code block (the </> icon in the toolbar). As you know padding is mandatory in Python so anyone trying to reproduce your issue can't really use the code you posted as all the padding it's gone.