Plugin doesn't isn't shown anymore but still exists

sseveur · December 2020

Hi,

I was creating a plugin in order to correct a weird behavior of one of our engine : while syncing a Teradata table to HDFS using the TDCH engine, dates (in a "%Y-%m-%D" format) are imported as dates in the schema even though they are strings.

I created a plugin recipe that copy the synced HDFS dataset to another using spark and modify the schema of our new dataset using the API in order to set the dates to string (which is the real format of our data) .

The plugin didn't work because of multiple reasons :

First, the pyspark module wasn't available (Error in Python process: At line 4: <type 'exceptions.ImportError'>: No module named pyspark) so we created a code environment dedicated to the plugin in order to fix it ;
From there, pyspark was correctly imported and the session created. But when we tried to retrieve the dataframe we had an error(Error in Python process: At line 23: <class 'py4j.protocol.Py4JJavaError'>: An error occurred while calling o23.classForName).

So I thought it was maybe due to the fact that I had specified the pyspark version (to 2.4.0) in the root folder and in the code-env/specs/python/requirements.txt.

Here I commented all my lines in my "requirements.json" in order to check the effect of this one because it was the first time I was using it and here the bug happened:

I could not access the plugin anymore (plugin_exists_but_doesnt.PNG) ;
I cannot create another plugin with the same id (creating_plugin_same_id_fail.PNG) ;
I ask the admin to see if he could download the plugin from the machine itself and he did.

In conclusion : the plugin still exists with everything but cannot be accessed due to the requirements.json being empty.

I tried to access it by modifying the url address but the editor isn't available.

Below the json and the plugin recipe.

requirements.json :

//{
  //  "python" : [
    //    {"name":"pyspark", "version":"==2.4.0"}
    //],
    //"R" : [
    //]
//}

recipe.py :

# -*- coding: utf-8 -*-

from dataiku.customrecipe import *
import dataiku
import numpy as np
import pyspark
from dataiku import spark as dkuspark
from pyspark import SparkContext
from pyspark.sql import SQLContext

### dÃ©buter notre session spark 
print('creating session')
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
print('session created')

### rÃ©cupÃ©rer notre dataset d'input (dataset + df) et le dataset d'output
# input
to_correct = get_input_names_for_role('to_correct')[0] # notre dataset qui va Ãªtre corrigÃ©

dataset_to_correct = dataiku.Dataset(to_correct) # RÃ©cupÃ©rer le dataset d'input rapidement grÃ¢ce Ã  spark
print('getting dataframe')
df = dkuspark.get_dataframe(sqlContext, dataset_to_correct) # le dataframe spark
print('df get')

#output 
output_name = get_output_names_for_role('main_output')[0].split('.')[-1] # notre dataset de sorti 
dataset_out = dataiku.Dataset(output_name)
dkuspark.write_with_schema(dataset_out, df) # Ã©crire le dataframe (copie Ã  l'identique) 
# testÃ© : 1.33 min afin de lancer spark + Ã©crire un df[100000,804]

### Modifier le dataset afin d'obtenir les variables dates en string
# init 
client = dataiku.api_client()
projectkey = dataiku.get_custom_variables().get('projectKey')
project = client.get_project(projectkey)
dataset = project.get_dataset(output_name) # pas le mÃªme type de dataset, attention

# RÃ©cupÃ©rer notre schÃ©ma 
schema = dataset.get_schema() # au cas oÃ¹ l'on ai une erreur     
new_schema = dataset.get_schema() # celui que l'on va modifier 
for i in range(len(new_schema['columns'])) : # pour chaque colonne de notre schema
    column = new_schema['columns'][i] # notre colonne en cours 
    if column['type'] == u'date' : # si nous avions une date Ã  la base 
        new_schema['columns'][i]['type'] = u'string' # modifier notre date en string afin que Dataiku l'affiche dans l'explore 

# Une fois toutes les modifications effectuÃ©es, les enregistrer au sein du dataset 
result = dataset.set_schema(new_schema)

# ici un dictionnaire avec les rÃ©sulats de nos modifications    
resultat = dict()
resultat['dataset_name'] = output_name # le nom du dataset modifier
resultat['result'] = result # est-ce que le traitement Ã  correctement fonctionnÃ©
resultat['old_schema'] = schema # l'ancien schema au sein de notre dataset 

if True in (result['error'], result['fatal']) : # si on a une erreur
    print("La modification du schÃ©ma a Ã©chouÃ©. Re-modification Ã  l'Ã©tat original")
    resultat['new_schema'] = schema
    resultat['presence_erreur'] = True
    dataset.set_schema(schema) # remettre le schema d'origine 
else : # si tout s'est dÃ©roulÃ© correctemnt
    resultat['new_schema'] = new_schema
    resultat['presence_erreur'] = False
    
print("Ici les modifications effectuÃ©es :\n%s" % resultat)

Greetings

sseveur · December 2020

I just wanted to inform you about this bug.

To solve it : either you delete the plugin folder from the machine or using the command line, modify the requirements.json (un-commenting the corresponding lines)

Plugin doesn't isn't shown anymore but still exists

Answers

Categories

Setup Info

Tags