API score Python Code optimization
Hello everyone,
Hope you're doing well !
In order to optimize the following Python code that takes more than 6h40min for 4.5M rows,
# Compute recipe outputs from inputs input_json = json.loads(input_df.fillna("").to_json(orient='index')) row = 0 data = [] while row < input_df.shape[0]: prediction = client.predict_record("random_forest", input_json[str(row)]) prediction["result"]['Id_client'] = input_json[str(row)]['Id_client'] prediction["result"]['proba_1']=prediction["result"]['probas']['1'] row += 1 data.append(prediction["result"]) application_scored_df = pd.DataFrame(data) # Write recipe outputs application_score = dataiku.Dataset("APPLICATION_SCORE") application_score.write_with_schema(application_scored_df)
I wrote this code that uses the function "predict_records" instead of "predict_record",
#predict_records prend en param une liste de dictionnaires commençant par 'features' : [{'features':{...}}, {'features':{}}...] l=[] s=['features'] for i in range(0, input_df.shape[0]) : l0=input_df.iloc[[i,]].set_index([s]) dicti=l0.to_dict('index') l.append(dicti) prediction_test = client.predict_records("random_forest", l) prediction_df=pd.DataFrame(prediction_test['results']) prediction_df.insert(0, 'Id_client', input_df['Id_client']) prediction_df['proba_1']=prediction_df['probas'].apply(pd.Series)[['1']] application_score = dataiku.Dataset("APPLICATION_SCORE") application_score.write_with_schema(prediction_df)
the code works very well on 5 rows but once applied on the full dataset, I have this error :
Do you know what this error is about and how can I solve this to reduce execution time ?
Thank's in advance,
Kenza
Answers
-
Hi,
For your actual error, could you please attach the full log (Actions > View full job log) ?
Since your code is running within DSS, do you really need to call an API node ? You could also use a scoring recipe locally, which would be much faster.
-
To use the scoring recipe, you have to be in the same flow as the one where the model was implemented, am I wrong ?
For my case, I have the model in the development flow and want to use it in an other flow (production one), this is why I am using the API node.