-
Conversion to Parquet fails in Hadoop HDFS
$ hadoop version Hadoop 3.1.2 Source code repository https://github.com/apache/hadoop.git -r 1019dde65bcf12e05ef48ac71e84550d589e5d9a Compiled by sunilg on 2019-01-29T01:39Z Compiled with protoc 2.5.0 From source with checksum 64b8bdd4ca6e77cce75a93eb09ab2a9 This command was run using…
-
No connection defined to upload files/jars
I am trying to execute a PySpark recipe on a remote AWS EMR Spark cluster and I am getting: Your Spark settings don't define a temporary storage for yarn-cluster modein act.compute_prepdataset1_NP: No connection defined to upload files/jars I am using this runtime configuration: I also tried adding: spark.yarn.stagingDir…
-
How to save a keras model from a python recipe in a folder ?
I would like to save a keras model in a folder. I can not figure out how to save the weights of my models because I do not find the correct filepath. The needed code to achieve this goal is : model.save_weights(filepath) Even with this syntax : path = str(trained_LSTM_info['accessInfo']['root'])…
-
HDFS - Force Parquet as default settings for recipe output
Greetings ! I'm currently on a platform with Dataiku 11.3.1 and writing datasets on HDFS. IT requires all dataset to be written in Parquet, but the default setting is on CSV (Hive) and it can generate errors. Is there a way to configure the connection to force the default settings to be Parquet ? Best regards,
-
Enabling parquet format in Dataiku DSS
Hi Currently when we write into Dataiku file system we only csv and avro format. How can I enable parque format in Dataiku DSS running on linux platform on EC2 instance. I need steps for that. Also we don't have any HDFS connection setup as well. Regards, Ankur.
-
Permission Denied Installing Standalone Hadoop Integration
I am trying to install the standalone hadoop integration for Dataiku. My Dataiku instance is hosted on a linux server and when I follow the directions for standalone installation here (Setting up Hadoop integration — Dataiku DSS 11 documentation), I get a permission denied error because it's treating the…
-
How to add a file to the Resources directory so that it is accessible at runtime
How can I quickly update the code environment, upload a zipped certificate file to the resources directory, and then make the certificate file accessible during runtime? I upload the file, modify the script so that an Environment Variable pointing to the folder is included, and grant the folder permissions. The path to the…
-
Accessing Spark web UI
Hello, I am a beginner in Spark and I am trying to setup Spark on our Kubernetes cluster. The cluster is now working and I can run Spark jobs; however, I want to access Spark web UI to inspect how my job is being distributed. We usually port-forward a port(4040), but I am not being able to check which pod is the driver pod…
-
PySpark exit recipe with Warning status
I have a PySpark recipe which reads a dataset, and extracts a column based on first index (first row). In a scenario when the input dataset partition is empty, it throws a normal error: 'index out of range'. To handle this I created a try except block, and want to end the recipe at that except block. I tried sys.exit(1)…
-
NoClassDefFoundError when reading a parquet file
I have setup an HDFS connection to access a Google Cloud Storage bucket on which I have parquet files. After adding GoogleHadoopFileSystem to the hadoop configuration I can access the bucket and files. However when I create a new dataset and I select a parquet file (including a standard sample found at…