Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello ! I'm back with another question on the API 😉
Here the thing: I am building an entire project throw the python API. First I build the datasets:
new_dataset = new_project.create_dataset(dataset_name=dataset_name,
type="HDFS",
params={
'connection': connection
,'path': "/" + dataset_name
,'hiveDatabase': connection
,u'hiveTableName': dataset_name
,u'metastoreSynchronizationEnabled': True
},
formatType='orcfile'
)
+ adding a schema from datasets in another project. Everything works fine here.
Then I build the recipes. Since the code is really long I won't post it here. But here the ideas:
I choose a name, and recipe type, then use the CodeRecipeCreator. Then I select some input(s) and output(s), and finally build the recipe (recipe_builder_object.build() ) . Last step I put some code in the definition and payload object.
Ok everything worked pretty fine! Python/R/Hive/Stack... recipes are working, put the data in the right dataset etc...
And then I tried some Spark Recipe (Pyspark/SparkR). The recipe seems okay, but when I try to run it in my project, I get the following error:
[Exec-61] [INFO] [dku.utils] - : org.apache.hadoop.mapred.FileAlreadyExistsException: 'Some_path_to_hdfs/dataset_name' already exists.
I just saw that it's not about the recipe type, but how do you write the output. If you write the pandas dataframe with dataiku.Dataset("dataset_name").write_with_schema(dataset_schema) the Spark recipe works. But instead if you're working with the Spark version:
ds_fac_output = dataiku.Dataset(dataset_output_name)
dkuspark.write_with_schema(ds_fac_output, spark_df)
So I looked into the parameters of the dataset/recipe, and I can't figure out why this isn't working. As far as I know, I am working in the 'erase' mode, not the 'append'. If I manually delete the directory containing the ORC dataset on HDFS, the recipe works, but only ONCE. If I run it again, I still get the same error.
Hoping I didn't omit anything! 🙂
Thanks