Advanced Designer Learning Path is now live! Read More

Plugin doesn't isn't shown anymore but still exists

Level 2
Level 2
Plugin doesn't isn't shown anymore but still exists

Hi, 

I was creating a plugin in order to correct a weird behavior of one of our engine : while syncing a Teradata table to HDFS using the TDCH engine, dates (in a "%Y-%m-%D" format) are imported as dates in the schema even though they are strings. 

I created a plugin recipe that copy the synced HDFS dataset to another using spark and modify the schema of our new dataset using the API in order to set the dates to string (which is the real format of our data) .

The plugin didn't work because of multiple reasons : 

  • First, the pyspark module wasn't available (Error in Python process: At line 4: <type 'exceptions.ImportError'>: No module named pyspark) so we created a code environment dedicated to the plugin in order to fix it ; 
  • From there, pyspark was correctly imported and the session created. But when we tried to retrieve the dataframe we had an error(Error in Python process: At line 23: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o23.classForName).

So I thought it was maybe due to the fact that I had specified the pyspark version (to 2.4.0) in the root folder and in the code-env/specs/python/requirements.txt.

Here I commented all my lines in my "requirements.json" in order to check the effect of this one because it was the first time I was using it and here the bug happened: 

  1. I could not access the plugin anymore (plugin_exists_but_doesnt.PNG) ; 
  2. I cannot create another plugin with the same id (creating_plugin_same_id_fail.PNG) ;
  3. I ask the admin to see if he could download the plugin from the machine itself and he did. 

In conclusion : the plugin still exists with everything but cannot be accessed due to the requirements.json being empty.

I tried to access it by modifying the url address but the editor isn't available. 

Below the json and the plugin recipe.

requirements.json : 

//{
  //  "python" : [
    //    {"name":"pyspark", "version":"==2.4.0"}
    //],
    //"R" : [
    //]
//}

 

recipe.py : 

# -*- coding: utf-8 -*-

from dataiku.customrecipe import *
import dataiku
import numpy as np
import pyspark
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

### débuter notre session spark 
print('creating session')
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
print('session created')

### récupérer notre dataset d'input (dataset + df) et le dataset d'output
# input
to_correct = get_input_names_for_role('to_correct')[0] # notre dataset qui va être corrigé

dataset_to_correct = dataiku.Dataset(to_correct) # Récupérer le dataset d'input rapidement grâce à spark
print('getting dataframe')
df = dkuspark.get_dataframe(sqlContext, dataset_to_correct) # le dataframe spark
print('df get')

#output 
output_name = get_output_names_for_role('main_output')[0].split('.')[-1] # notre dataset de sorti 
dataset_out = dataiku.Dataset(output_name)
dkuspark.write_with_schema(dataset_out, df) # écrire le dataframe (copie à l'identique) 
# testé : 1.33 min afin de lancer spark + écrire un df[100000,804]

### Modifier le dataset afin d'obtenir les variables dates en string
# init 
client = dataiku.api_client()
projectkey = dataiku.get_custom_variables().get('projectKey')
project = client.get_project(projectkey)
dataset = project.get_dataset(output_name) # pas le même type de dataset, attention

# Récupérer notre schéma 
schema = dataset.get_schema() # au cas où l'on ai une erreur     
new_schema = dataset.get_schema() # celui que l'on va modifier 
for i in range(len(new_schema['columns'])) : # pour chaque colonne de notre schema
    column = new_schema['columns'][i] # notre colonne en cours 
    if column['type'] == u'date' : # si nous avions une date à la base 
        new_schema['columns'][i]['type'] = u'string' # modifier notre date en string afin que Dataiku l'affiche dans l'explore 

# Une fois toutes les modifications effectuées, les enregistrer au sein du dataset 
result = dataset.set_schema(new_schema)

# ici un dictionnaire avec les résulats de nos modifications    
resultat = dict()
resultat['dataset_name'] = output_name # le nom du dataset modifier
resultat['result'] = result # est-ce que le traitement à correctement fonctionné
resultat['old_schema'] = schema # l'ancien schema au sein de notre dataset 

if True in (result['error'], result['fatal']) : # si on a une erreur
    print("La modification du schéma a échoué. Re-modification à l'état original")
    resultat['new_schema'] = schema
    resultat['presence_erreur'] = True
    dataset.set_schema(schema) # remettre le schema d'origine 
else : # si tout s'est déroulé correctemnt
    resultat['new_schema'] = new_schema
    resultat['presence_erreur'] = False
    
print("Ici les modifications effectuées :\n%s" % resultat)

 

Greetings

0 Kudos
1 Reply
Level 2
Level 2
Author

I just wanted to inform you about this bug. 

To solve it : either you delete the plugin folder from the machine or using the command line, modify the requirements.json (un-commenting the corresponding lines)

0 Kudos