Failed to train : <class 'UnicodeEncodeError'> : charmap

Pranay
Level 2
Failed to train : <class 'UnicodeEncodeError'> : charmap

Hello, I am using the free trail version of Dataiku DSS 12.1.2 on localhost to use it for a recommendation system. While training the model, I am getting the following error - "Failed to train : <class 'UnicodeEncodeError'> : charmap" with the logs snippet shown below. Can someone please help me solve this issue? I have tried to update the packages, and rebuild the env but that didn't work too. 

Logs: 

[2023/08/04-13:12:38.670] [MRT-1706] [INFO] [dku.kernels]  - Process was cleaned up by monitoring thread
[2023/08/04-13:12:38.673] [MRT-1706] [INFO] [dku.kernels]  - Trying to enrich exception: com.dataiku.dip.io.SocketBlockLinkKernelException: Failed to train : <class 'UnicodeEncodeError'> : charmap from kernel com.dataiku.dip.analysis.coreservices.AnalysisMLKernel@2855092d retcode=0
[2023/08/04-13:12:38.676] [MRT-1706] [WARN] [dku.analysis.ml.python]  - Training failed
com.dataiku.dip.io.SocketBlockLinkKernelException: Failed to train : <class 'UnicodeEncodeError'> : charmap
	at com.dataiku.dip.io.SocketBlockLinkInteraction.throwExceptionFromPython(SocketBlockLinkInteraction.java:302)
	at com.dataiku.dip.io.SocketBlockLinkInteraction$AsyncResult.checkException(SocketBlockLinkInteraction.java:215)
	at com.dataiku.dip.io.SocketBlockLinkInteraction$AsyncResult.get(SocketBlockLinkInteraction.java:190)
	at com.dataiku.dip.io.SingleCommandKernelLink$1.call(SingleCommandKernelLink.java:211)
	at com.dataiku.dip.analysis.ml.prediction.PredictionTrainAdditionalThread.process(PredictionTrainAdditionalThread.java:76)
	at com.dataiku.dip.analysis.ml.shared.PRNSTrainThread.run(PRNSTrainThread.java:170)
[2023/08/04-13:12:38.679] [MRT-1706] [INFO] [dku.block.link]  - Closed socket
[2023/08/04-13:12:38.681] [MRT-1706] [INFO] [dku.block.link]  - Closed socket
[2023/08/04-13:12:38.684] [MRT-1706] [INFO] [dku.block.link]  - Closed serverSocket
[2023/08/04-13:12:38.686] [MRT-1706] [ERROR] [dku.analysis.ml.python]  - Processing failed
com.dataiku.dip.io.SocketBlockLinkKernelException: Failed to train : <class 'UnicodeEncodeError'> : charmap
	at com.dataiku.dip.io.SocketBlockLinkInteraction.throwExceptionFromPython(SocketBlockLinkInteraction.java:302)
	at com.dataiku.dip.io.SocketBlockLinkInteraction$AsyncResult.checkException(SocketBlockLinkInteraction.java:215)
	at com.dataiku.dip.io.SocketBlockLinkInteraction$AsyncResult.get(SocketBlockLinkInteraction.java:190)
	at com.dataiku.dip.io.SingleCommandKernelLink$1.call(SingleCommandKernelLink.java:211)
	at com.dataiku.dip.analysis.ml.prediction.PredictionTrainAdditionalThread.process(PredictionTrainAdditionalThread.java:76)
	at com.dataiku.dip.analysis.ml.shared.PRNSTrainThread.run(PRNSTrainThread.java:170)
[2023/08/04-13:12:38.690] [MRT-1706] [INFO] [dku.analysis.ml]  - Locking model train info file C:\Users\t0278540\AppData\Local\Dataiku\DataScienceStudio\dss_home\analysis-data\RECOMMENDATION_ATTEMPT1\mi8OCo9P\mgjMlxna\sessions\s2\pp1\m1\train_info.json
[2023/08/04-13:12:38.699] [MRT-1706] [INFO] [dku.analysis.ml]  - Unlocking model train info file C:\Users\t0278540\AppData\Local\Dataiku\DataScienceStudio\dss_home\analysis-data\RECOMMENDATION_ATTEMPT1\mi8OCo9P\mgjMlxna\sessions\s2\pp1\m1\train_info.json
[2023/08/04-13:12:38.702] [FT-TrainWorkThread-BWE8Tw2G-1704] [INFO] [dku.analysis.ml.python] T-mgjMlxna - [ct: 84515] Processing thread joined ...
[2023/08/04-13:12:38.706] [FT-TrainWorkThread-BWE8Tw2G-1704] [INFO] [dku.analysis] T-mgjMlxna - [ct: 84519] Train done

Operating system used: Windows (10 Enterprise)

 

0 Kudos
2 Replies
JordanB
Dataiker

Hi @Pranay,

It appears that there is a character encoding issue. Please check your dataset for non-ascii characters. One way to remove them would be to use a prepare recipe -> "simplify text" or "transform string" processor.

Let me know if you have any questions. 

Thanks!

Jordan

Pranay
Level 2
Author

Thanks for the solution. For the short term, I manually removed the non-ascii characters from my small dataset, but the long term solution of using "transform strings" works perfectly!

Labels

?
Labels (4)
A banner prompting to get Dataiku