Survey banner
The Dataiku Community is moving to a new home! We are temporary in read only mode: LEARN MORE

Python process failed during building knowledge bank

Ajoshi005
Level 1
Python process failed during building knowledge bank

I am trying to build a knowledge bank with PDF docs(about 2pages) using OpenAI embedding model ADA-002. I am using the visual recipes (text splitter) which runs succesfully. But while running the the embedding recipe(build knowledge bank) I am getting this error msg

 Oops: an unexpected error occurred

The Python process failed (exit code: 1). More info might be available in the logs.

Please see our options for getting help

Logs:-

[11:37:43] [INFO] [dip.exec.resultHandler] - Did not find a specific error from error files or logs, falling back on return code
[11:37:43] [INFO] [dku.ml.distributed.pool] - Closing worker pool pool-pge76po0rmu1ajbt
[11:37:43] [INFO] [dku.ml.distributed.service] - Unregistered worker pool: pool-pge76po0rmu1ajbt
[11:37:43] [INFO] [dku.flow.activity] - Run thread failed for activity compute_Extracted_text_RBC_HSBC_embedded_1_NP
com.dataiku.dip.exceptions.ProcessDiedException: The Python process failed (exit code: 1). More info might be available in the logs.
	at com.dataiku.dip.dataflow.common.CodeBasedThingHelper.throwSubprocessError(CodeBasedThingHelper.java:23)
	at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:29)
	at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:70)
	at com.dataiku.dip.dataflow.exec.AbstractPythonRecipeRunner.executeModule(AbstractPythonRecipeRunner.java:99)
	at com.dataiku.dip.recipes.nlp.rag_embedding.RAGEmbeddingRecipeRunner$1.run(RAGEmbeddingRecipeRunner.java:124)
	at com.dataiku.dip.recipes.nlp.rag_embedding.RAGEmbeddingRecipeRunner.run(RAGEmbeddingRecipeRunner.java:104)
	at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[11:37:43] [INFO] [dku.flow.activity] running compute_Extracted_text_RBC_HSBC_embedded_1_NP - activity is finished
[11:37:43] [ERROR] [dku.flow.activity] running compute_Extracted_text_RBC_HSBC_embedded_1_NP - Activity failed
com.dataiku.dip.exceptions.ProcessDiedException: The Python process failed (exit code: 1). More info might be available in the logs.
	at com.dataiku.dip.dataflow.common.CodeBasedThingHelper.throwSubprocessError(CodeBasedThingHelper.java:23)
	at com.dataiku.dip.dataflow.exec.JobExecutionResultHandler.handleExecutionResult(JobExecutionResultHandler.java:29)
	at com.dataiku.dip.dataflow.exec.AbstractCodeBasedActivityRunner.execute(AbstractCodeBasedActivityRunner.java:70)
	at com.dataiku.dip.dataflow.exec.AbstractPythonRecipeRunner.executeModule(AbstractPythonRecipeRunner.java:99)
	at com.dataiku.dip.recipes.nlp.rag_embedding.RAGEmbeddingRecipeRunner$1.run(RAGEmbeddingRecipeRunner.java:124)
	at com.dataiku.dip.recipes.nlp.rag_embedding.RAGEmbeddingRecipeRunner.run(RAGEmbeddingRecipeRunner.java:104)
	at com.dataiku.dip.dataflow.jobrunner.ActivityRunner$FlowRunnableThread.run(ActivityRunner.java:374)
[11:37:43] [INFO] [dku.flow.activity] running compute_Extracted_text_RBC_HSBC_embedded_1_NP - Executing default post-activity lifecycle hook
[11:37:43] [INFO] [dku.flow.activity] running compute_Extracted_text_RBC_HSBC_embedded_1_NP - Done post-activity tasks

 

0 Kudos
4 Replies
JordanB
Dataiker

Hi @Ajoshi005,

Unfortunately, the logs you captured here are not verbose enough to serve as a meaningful starting point for troubleshooting. However, as noted, more info might be available in the logs. I would recommend navigating to your project -> Jobs -> select the embedding job -> Actions -> View full job log. Working up from the bottom of the logs where the job fails, please scan the logs for any helpful hints, such as code env package version errors.

Screenshot 2024-01-26 at 11.56.14 AM.png

Please feel free to provide any additional logs that you think may be helpful.

Thanks!

0 Kudos
Ajoshi005
Level 1
Author

Thank you for the quick response. I realised that I was running the Embed recipe without installing a RAG Python environment. However I am unable to find the RAG python 3.9 env while trying to create. The options available are python 3.7 and python 3.11(experimental), both of which dont have option to add RAG packages. I tried to install the RAG packages (langchain,pinecone,Faiss,etc) as shown in dataiku tutorial in the 3.7 env and it still shows an error. this time the error log is as attached. The error looks to be the 

[2024/01/26-12:40:41.776] [null-err-42] [INFO] [dku.utils] - from langchain.vectorstores import FAISS, Pinecone, Chroma
[2024/01/26-12:40:41.776] [null-err-42] [INFO] [dku.utils] - ImportError: cannot import name 'Pinecone' from 'langchain.vectorstores' (/Users/akashjoshi/Library/DataScienceStudio/dss_home/code-envs/python/RAG/lib/python3.7/site-packages/langchain/vectorstores/__init__.py)

 

 

0 Kudos
Turribeach

As a side note this thread covers how to get different Python interpreters, like Python 3.9, enabled in Dataiku:

https://community.dataiku.com/t5/Setup-Configuration/Best-Practices-for-Updating-Python/m-p/38870

This may be a better way of solving your problem.

Ajoshi005
Level 1
Author

I was able to resolve the issue. Might be helpful for anyone trying to implement RAG in trial version of DSS(12.5). So here is what I did:- 

- Under Code Env I installed the python 3.11(experimental) version.

- Select Core packages version: Pandas1.5 (python 3.8 and bove)

- added the RAG packages through Add sets of packages option

(langchain==0.0.270
pydantic<2
chromadb
faiss-cpu
pinecone-client)

 

0 Kudos