Error using vectorization on N-grams to perform NLP

frenchboss · May 2019

I have a column where every entry is a user problem that I would like to analyse with NLP Machine learning. I keep getting this error:

Failed to train : <type 'exceptions.IOError'> : [Errno 2] No such file or directory: u'/apps/hadoop/data01/dataiku/data_dir/analysis-data/EUROPA/FQ9JmSC7/cnt5kb4d/sessions/s9/pp1/countvec_Customer Verbatim / Issue Detail.pkl'

here are the logs (couldn't copy full thing):




[2019-05-22 11:16:13,333] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:RemapValueToOutput
[2019-05-22 11:16:13,346] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:MultipleImputeMissingFromInput
[2019-05-22 11:16:13,346] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] MIMIFI: Imputing with map {}
[2019-05-22 11:16:13,346] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:FlushDFBuilder(num_flagonly)
[2019-05-22 11:16:13,347] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:FastSparseDummifyProcessor (Category)
[2019-05-22 11:16:13,364] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] Dummifier: Append a sparse block shape=(56096, 16) nnz=56095
[2019-05-22 11:16:13,365] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:FastSparseDummifyProcessor (Customer Type)
[2019-05-22 11:16:13,383] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] Dummifier: Append a sparse block shape=(56096, 4) nnz=55798
[2019-05-22 11:16:13,384] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:FastSparseDummifyProcessor (Potential Regulatory Theme)
[2019-05-22 11:16:13,402] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] Dummifier: Append a sparse block shape=(56096, 102) nnz=55682
[2019-05-22 11:16:13,402] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:FastSparseDummifyProcessor (Method of Contact)
[2019-05-22 11:16:13,420] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] Dummifier: Append a sparse block shape=(56096, 11) nnz=56095
[2019-05-22 11:16:13,421] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:MultipleImputeMissingFromInput
[2019-05-22 11:16:13,421] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] MIMIFI: Imputing with map {}
[2019-05-22 11:16:13,421] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:FlushDFBuilder(cat_flagpresence)
[2019-05-22 11:16:13,421] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH <class 'dataiku.doctor.preprocessing.dataframe_preprocessing.TextCountVectorizerProcessor'> (Customer Verbatim / Issue Detail)
[2019-05-22 11:16:13,423] [51597/MainThread] [INFO] [root] Using vectorizer: CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.7, max_features=None, min_df=0.001,
        ngram_range=(3, 1), preprocessor=None,
        stop_words=['m', 's', 'r', 've', 'd', 'tt', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', '...e', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too'],
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
[2019-05-22 11:16:15,064] [51597/MainThread] [INFO] [root] Produced a matrix of size (56096, 1778)
[2019-05-22 11:16:15,070] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:MultipleImputeMissingFromInput
[2019-05-22 11:16:15,070] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] MIMIFI: Imputing with map {}
[2019-05-22 11:16:15,070] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:FlushDFBuilder(interaction)
[2019-05-22 11:16:15,070] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:RealignTarget
[2019-05-22 11:16:15,070] [51597/MainThread] [INFO] [root] Realign target series = (56096,)
[2019-05-22 11:16:15,074] [51597/MainThread] [INFO] [root] After realign target: (56096,)
[2019-05-22 11:16:15,074] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DropRowsWhereNoTarget
[2019-05-22 11:16:15,075] [51597/MainThread] [INFO] [root] Deleting 0 rows because no target
[2019-05-22 11:16:15,075] [51597/MainThread] [INFO] [root] MF before = (56096, 1911) target before = (56096,)
[2019-05-22 11:16:15,080] [51597/MainThread] [INFO] [root] MultiFrame, dropping rows: []
[2019-05-22 11:16:15,129] [51597/MainThread] [INFO] [root] After DRWNT input_df=(56096, 21)
[2019-05-22 11:16:15,129] [51597/MainThread] [INFO] [root] MF after = (56096, 1911) target after = (56096,)
[2019-05-22 11:16:15,129] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
[2019-05-22 11:16:15,129] [51597/MainThread] [INFO] [root] ********* Pipieline state (Before feature selection)
[2019-05-22 11:16:15,129] [51597/MainThread] [INFO] [root]    input_df= (56096, 21) 
[2019-05-22 11:16:15,129] [51597/MainThread] [INFO] [root]    current_mf=(56096, 1911) 
[2019-05-22 11:16:15,129] [51597/MainThread] [INFO] [root]    PPR: 
[2019-05-22 11:16:15,129] [51597/MainThread] [INFO] [root]       target = <class 'pandas.core.series.Series'> ((56096,))
[2019-05-22 11:16:15,129] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:EmitCurrentMFAsResult
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root] Set MF index len 56096
[2019-05-22 11:16:15,130] [51597/MainThread] [DEBUG] [dku.ml.preprocessing] FIT/PROCESS WITH Step:DumpPipelineState
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root] ********* Pipieline state (At end)
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root]    input_df= (56096, 21) 
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root]    current_mf=(0, 0) 
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root]    PPR: 
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root]       UNPROCESSED = <class 'pandas.core.frame.DataFrame'> ((56096, 21))
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root]       TRAIN = <class 'dataiku.doctor.multiframe.MultiFrame'> ((56096, 1911))
[2019-05-22 11:16:15,130] [51597/MainThread] [INFO] [root]       target = <class 'pandas.core.series.Series'> ((56096,))
[2019-05-22 11:16:15,131] [51597/MainThread] [INFO] [root] END -  Preprocessing train set
Traceback (most recent call last):
  File "/apps/hadoop/data01/dataiku/dataiku-dss-5.0.2/python/dataiku/doctor/server.py", line 47, in serve
    ret = api_command(arg)
  File "/apps/hadoop/data01/dataiku/dataiku-dss-5.0.2/python/dataiku/doctor/dkuapi.py", line 45, in aux
    return api(**kwargs)
  File "/apps/hadoop/data01/dataiku/dataiku-dss-5.0.2/python/dataiku/doctor/commands.py", line 259, in train_prediction_models_nosave
    preproc_handler.save_data()
  File "/apps/hadoop/data01/dataiku/dataiku-dss-5.0.2/python/dataiku/doctor/preprocessing_handler.py", line 165, in save_data
    self._save_resource(resource_name)
  File "/apps/hadoop/data01/dataiku/dataiku-dss-5.0.2/python/dataiku/doctor/preprocessing_handler.py", line 104, in _save_resource
    with open(self._resource_filepath(resource_name, type), "wb") as resource_file:
IOError: [Errno 2] No such file or directory: u'/apps/hadoop/data01/dataiku/data_dir/analysis-data/EUROPA/FQ9JmSC7/cnt5kb4d/sessions/s10/pp1/countvec_Customer Verbatim / Issue Detail.pkl'
[2019/05/22-11:16:15.137] [MRT-2523415] [INFO] [dku.block.link.interaction]  - Check result for nullity exceptionIfNull=true result=null
[2019/05/22-11:16:15.365] [wrapper-stderr-2523439] [INFO] [dku.utils]  - 2019-05-22 11:16:15,358 51573 INFO [Child] Process 51597 exited with exit=0 signal=0
[2019/05/22-11:16:15.365] [wrapper-stderr-2523439] [INFO] [dku.utils]  - 2019-05-22 11:16:15,359 51573 INFO Full child code: 0
[2019/05/22-11:16:15.382] [KNL-python-single-command-kernel-monitor-2523444] [INFO] [dku.kernels]  - Process done with code 0
[2019/05/22-11:16:15.383] [KNL-python-single-command-kernel-monitor-2523444] [INFO] [dip.tickets]  - Destroying API ticket for analysis-ml-EUROPA-Wx3BdSd on behalf of gpaille
[2019/05/22-11:16:15.383] [MRT-2523415] [INFO] [dku.kernels]  - Getting kernel tail
[2019/05/22-11:16:15.425] [MRT-2523415] [INFO] [dku.kernels]  - Trying to enrich exception: com.dataiku.dip.io.SocketBlockLinkKernelException: Failed to train : <type 'exceptions.IOError'> : [Errno 2] No such file or directory: u'/apps/hadoop/data01/dataiku/data_dir/analysis-data/EUROPA/FQ9JmSC7/cnt5kb4d/sessions/s10/pp1/countvec_Customer Verbatim / Issue Detail.pkl' from kernel com.dataiku.dip.analysis.coreservices.AnalysisMLKernel@704518a7 process=null pid=?? retcode=0
[2019/05/22-11:16:15.426] [MRT-2523415] [WARN] [dku.analysis.ml.python]  - Training failed
com.dataiku.dip.io.SocketBlockLinkKernelException: Failed to train : <type 'exceptions.IOError'> : [Errno 2] No such file or directory: u'/apps/hadoop/data01/dataiku/data_dir/analysis-data/EUROPA/FQ9JmSC7/cnt5kb4d/sessions/s10/pp1/countvec_Customer Verbatim / Issue Detail.pkl'
	at com.dataiku.dip.io.SocketBlockLinkInteraction.throwExceptionFromPython(SocketBlockLinkInteraction.java:298)
	at com.dataiku.dip.io.SocketBlockLinkInteraction$AsyncResult.checkException(SocketBlockLinkInteraction.java:215)
	at com.dataiku.dip.io.SocketBlockLinkInteraction$AsyncResult.get(SocketBlockLinkInteraction.java:190)
	at com.dataiku.dip.io.SingleCommandKernelLink$1.call(SingleCommandKernelLink.java:208)
	at com.dataiku.dip.analysis.ml.prediction.PredictionTrainAdditionalThread.process(PredictionTrainAdditionalThread.java:75)
	at com.dataiku.dip.analysis.ml.shared.PRNSTrainThread.run(PRNSTrainThread.java:130)
[2019/05/22-11:16:15.436] [FT-TrainWorkThread-9rsHhxAF-2523414] [INFO] [dku.analysis.ml.python] T-cnt5kb4d - Processing thread joined ...
[2019/05/22-11:16:15.436] [FT-TrainWorkThread-9rsHhxAF-2523414] [INFO] [dku.analysis.ml.python] T-cnt5kb4d - Joining processing thread ...
[2019/05/22-11:16:15.437] [FT-TrainWorkThread-9rsHhxAF-2523414] [INFO] [dku.analysis.ml.python] T-cnt5kb4d - Processing thread joined ...
[2019/05/22-11:16:15.437] [FT-TrainWorkThread-9rsHhxAF-2523414] [INFO] [dku.analysis.prediction] T-cnt5kb4d - Train done
[2019/05/22-11:16:15.437] [FT-TrainWorkThread-9rsHhxAF-2523414] [INFO] [dku.analysis.prediction] T-cnt5kb4d - Train done
[2019/05/22-11:16:15.442] [FT-TrainWorkThread-9rsHhxAF-2523414] [INFO] [dku.analysis.prediction] T-cnt5kb4d - Publishing mltask-train-done reflected event

Nicolas_Servel · May 2019

Hello,

For us to investigate your issue, could you please provide us with:

- a screenshot of the Feature handling that you are using for this particular column

- the full log of the training (You can find it by clicking on "Logs" on the top right of your failed algorithm

Thanks,
Regards,

Nicolas Servel

frenchboss · May 2019

Hey I updated the question

Nicolas_Servel · May 2019

Hello again,

After investigation, the issue comes from the fact that your column contains a "/" character, that breaks the way of loading the required file.

The workaround is to rename your column and to remove the special character "/". You can for example do it in a "Script" on your analysis. To do so, go on the "Script" tab of your analysis, then click on the name of the column, then on "Rename".

We will work on fixing this issue for a future release of DSS.

Regards,

Nicolas Servel

Error using vectorization on N-grams to perform NLP

Tags

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories

Sign up to take part

Error using vectorization on N-grams to perform NLP

Tags

Welcome!

Answers

Welcome!

Welcome!

Quick Links

Categories