-
Spark Cluster mode
Hello, As we are using Spark heavily, we are having the problem of slowness of application launching in yarn cluster mode. The slowness comes from having many DSS related files and also many jars files has to be uploaded for every single spark application. We checked the feature of using Cluster mode. However, we know that…
-
Spark Setup
Hi all ,, I need to use the sparkSql and spark for python I installed the spark and it shown in the administration settings . but when I run the sparkSQL it raised this error Cannot run program "spark-submit" (in directory "/data/design/jobs/DC please anyone can help or send an article to follow the configuration Thanks in…
-
Access the partition value in Pyspark Recipe
I have a table that is partitioned by date, how can I access the partition date in a pyspark recipe I tried the following code but does not recognize actual_date fct_pm_card.select("application_id", "product") \ .filter(col('actual_date') <= end_date)
-
Use Custom UDFs on Visual Recipes
Hello Dataikers! Since all visual recipes are based on SparkSQL, some "advance" aggregations aren't available. In this case, I have 3 values on 3 columns: A, B, C. And I just want to compute Median from them. The problem is that Median function doesn't exist on my current Spark backend version, so I need to use a UDF to do…
-
Pyspark Code to Spark Dataframe
I am new to Dataiku. I have been going up and down all over the Dataiku document and trying to make sense of this documentation specially looking into the Dataiku Pyspark code recipe for my project, but I can not find anything useful !. Examples and syntax as simple as how to convert a Dataiku format to spark dataframe !…
-
Control write partitioning with Spark
There does not appear to be a way to write spark jobs to disk using a set partition scheme. This is normally done via dataframe.write.parquet(<path>, partitionBy=['year']), if one is to partition the data by year, for example. I am looking at the API page here: https://doc.dataiku.com/dss/latest/python-api/pyspark.html,…
-
PyCharm API not supporting PySpark recipes
Hi, the API you recently created and detailed here: https://academy.dataiku.com/latest/tutorial/code/pycharm.html does not appear to support any recipe other than made in Python. If I connect per the instructions, I can only see recipes that were made in Python. Since there is no real difference between PySpark and Python…
-
Spark pipeline merge rules
What kinds of visual receipts can be merged together during the job executions?
-
Unable to write spark df to csv with column headers and multiple partitions?
Writing spark df into csv along with headers repartitions the df into 1 by default. So it takes a lot of time while writing considering the dataset is large because only 1 partition is active. How do I write spark dataframe to csv in hdfs with column headers and multiple partitions, so that it runs faster?
-
WebApps with PySpark at backend
From this great tutorial on building webapps http://learn.dataiku.com/howto/code/webapps/use-python-backend.html I can see that I can use Python at the backend for larger volumes of data. Does this extend to PySpark?