Plugin doesn't isn't shown anymore but still exists
Hi,
I was creating a plugin in order to correct a weird behavior of one of our engine : while syncing a Teradata table to HDFS using the TDCH engine, dates (in a "%Y-%m-%D" format) are imported as dates in the schema even though they are strings.
I created a plugin recipe that copy the synced HDFS dataset to another using spark and modify the schema of our new dataset using the API in order to set the dates to string (which is the real format of our data) .
The plugin didn't work because of multiple reasons :
- First, the pyspark module wasn't available (Error in Python process: At line 4: <type 'exceptions.ImportError'>: No module named pyspark) so we created a code environment dedicated to the plugin in order to fix it ;
- From there, pyspark was correctly imported and the session created. But when we tried to retrieve the dataframe we had an error(Error in Python process: At line 23: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o23.classForName).
So I thought it was maybe due to the fact that I had specified the pyspark version (to 2.4.0) in the root folder and in the code-env/specs/python/requirements.txt.
Here I commented all my lines in my "requirements.json" in order to check the effect of this one because it was the first time I was using it and here the bug happened:
- I could not access the plugin anymore (plugin_exists_but_doesnt.PNG) ;
- I cannot create another plugin with the same id (creating_plugin_same_id_fail.PNG) ;
- I ask the admin to see if he could download the plugin from the machine itself and he did.
In conclusion : the plugin still exists with everything but cannot be accessed due to the requirements.json being empty.
I tried to access it by modifying the url address but the editor isn't available.
Below the json and the plugin recipe.
requirements.json :
//{ // "python" : [ // {"name":"pyspark", "version":"==2.4.0"} //], //"R" : [ //] //}
recipe.py :
# -*- coding: utf-8 -*- from dataiku.customrecipe import * import dataiku import numpy as np import pyspark from dataiku import spark as dkuspark from pyspark import SparkContext from pyspark.sql import SQLContext ### débuter notre session spark print('creating session') sc = SparkContext().getOrCreate() sqlContext = SQLContext(sc) print('session created') ### récupérer notre dataset d'input (dataset + df) et le dataset d'output # input to_correct = get_input_names_for_role('to_correct')[0] # notre dataset qui va être corrigé dataset_to_correct = dataiku.Dataset(to_correct) # Récupérer le dataset d'input rapidement grâce à spark print('getting dataframe') df = dkuspark.get_dataframe(sqlContext, dataset_to_correct) # le dataframe spark print('df get') #output output_name = get_output_names_for_role('main_output')[0].split('.')[-1] # notre dataset de sorti dataset_out = dataiku.Dataset(output_name) dkuspark.write_with_schema(dataset_out, df) # écrire le dataframe (copie à l'identique) # testé : 1.33 min afin de lancer spark + écrire un df[100000,804] ### Modifier le dataset afin d'obtenir les variables dates en string # init client = dataiku.api_client() projectkey = dataiku.get_custom_variables().get('projectKey') project = client.get_project(projectkey) dataset = project.get_dataset(output_name) # pas le même type de dataset, attention # Récupérer notre schéma schema = dataset.get_schema() # au cas où l'on ai une erreur new_schema = dataset.get_schema() # celui que l'on va modifier for i in range(len(new_schema['columns'])) : # pour chaque colonne de notre schema column = new_schema['columns'][i] # notre colonne en cours if column['type'] == u'date' : # si nous avions une date à la base new_schema['columns'][i]['type'] = u'string' # modifier notre date en string afin que Dataiku l'affiche dans l'explore # Une fois toutes les modifications effectuées, les enregistrer au sein du dataset result = dataset.set_schema(new_schema) # ici un dictionnaire avec les résulats de nos modifications resultat = dict() resultat['dataset_name'] = output_name # le nom du dataset modifier resultat['result'] = result # est-ce que le traitement à correctement fonctionné resultat['old_schema'] = schema # l'ancien schema au sein de notre dataset if True in (result['error'], result['fatal']) : # si on a une erreur print("La modification du schéma a échoué. Re-modification à l'état original") resultat['new_schema'] = schema resultat['presence_erreur'] = True dataset.set_schema(schema) # remettre le schema d'origine else : # si tout s'est déroulé correctemnt resultat['new_schema'] = new_schema resultat['presence_erreur'] = False print("Ici les modifications effectuées :\n%s" % resultat)
Greetings
Answers
-
I just wanted to inform you about this bug.
To solve it : either you delete the plugin folder from the machine or using the command line, modify the requirements.json (un-commenting the corresponding lines)