Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Dear Experts,
currently using Dataiku online. I am trying to read the cleaned dataset , train it and get the model stored in the folder
I am new in using Dataiku api, would like to take help here .
Then use it in dataiku to test and create api.
# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import gensim
import nltk
from gensim.models import Word2Vec
from nltk import word_tokenize
# Read recipe inputs
CleansedData = dataiku.Folder("YEM8QpBl")
#CleansedData_info = CleansedData.get_info()
Source_Path = CleansedData.get_path()
path_Of_CSV = os.path.join(folder_path, "CleanDataForSentence2Vec.csv")
df = pd.read_csv(path_of_csv)#Problem here
# get array of titles
titles = df['title'].values.tolist()
# tokenize the each title
tok_titles = [word_tokenize(title) for title in titles]
# refer to here for all parameters:
# https://radimrehurek.com/gensim/models/word2vec.html
model = Word2Vec(tok_titles, sg=1, size=100, window=5, min_count=5, workers=4,
iter=100)
#model.save('./data/job_titles.model')
# Write recipe outputs #Problem here too
Model = dataiku.Folder("rlACbXYw");
path = Model.get_path();
model.save(path/'job_titles.model')
Model_info = Model.get_info()
Please find the code above!
Do suggest the needful!
Br
Ash
Hi Ash,
Based on the context on another channel what is happening here is your managed folders are not local.
This means you will need to use managed folder read/write APIs instead. e.g get_download_stream and upload_stream. Please see some suggested changes in the code below.
https://knowledge.dataiku.com/latest/courses/folders/managed-folders.html
Let me know if that works for you!
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import gensim
import nltk
from gensim.models import Word2Vec
from nltk import word_tokenize
from io import BytesIO
# Read recipe inputs
CleansedData = dataiku.Folder("YEM8QpBl")
#CleansedData_info = CleansedData.get_info()
Source_Path = CleansedData.get_path()
# you can also use list_paths_in_partition()
# change to something like
with CleansedData.get_download_stream("/CleanDataForSentence2Vec.csv) as stream:
df = pd.read_csv(stream)
# get array of titles
titles = df['title'].values.tolist()
# tokenize the each title
tok_titles = [word_tokenize(title) for title in titles]
# refer to here for all parameters:
# https://radimrehurek.com/gensim/models/word2vec.html
model = Word2Vec(tok_titles, sg=1, size=100, window=5, min_count=5, workers=4,
iter=100)
#model.save('./data/job_titles.model')
# Write recipe outputs #Problem here too
#change this part to something like
Model = dataiku.Folder("rlACbXYw")
path = Model.get_path()
bytes_container = BytesIO()
model.save(bytes_container)
bytes_container.seek(0)
Model.upload_stream("saved_model.model", bytes_container)
Model_info = Model.get_info()
Hi Ash,
Based on the context on another channel what is happening here is your managed folders are not local.
This means you will need to use managed folder read/write APIs instead. e.g get_download_stream and upload_stream. Please see some suggested changes in the code below.
https://knowledge.dataiku.com/latest/courses/folders/managed-folders.html
Let me know if that works for you!
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import gensim
import nltk
from gensim.models import Word2Vec
from nltk import word_tokenize
from io import BytesIO
# Read recipe inputs
CleansedData = dataiku.Folder("YEM8QpBl")
#CleansedData_info = CleansedData.get_info()
Source_Path = CleansedData.get_path()
# you can also use list_paths_in_partition()
# change to something like
with CleansedData.get_download_stream("/CleanDataForSentence2Vec.csv) as stream:
df = pd.read_csv(stream)
# get array of titles
titles = df['title'].values.tolist()
# tokenize the each title
tok_titles = [word_tokenize(title) for title in titles]
# refer to here for all parameters:
# https://radimrehurek.com/gensim/models/word2vec.html
model = Word2Vec(tok_titles, sg=1, size=100, window=5, min_count=5, workers=4,
iter=100)
#model.save('./data/job_titles.model')
# Write recipe outputs #Problem here too
#change this part to something like
Model = dataiku.Folder("rlACbXYw")
path = Model.get_path()
bytes_container = BytesIO()
model.save(bytes_container)
bytes_container.seek(0)
Model.upload_stream("saved_model.model", bytes_container)
Model_info = Model.get_info()
Hello @AlexT ,
Thanks it worked! I have few more questions
1) can we reverse the process? I want to load the model back from the folder! How to achieve it? same as streams? Can you share a sample snippet
2) Can I publish the model in to model repository with code? Snippet please
Thanks and Regards,
Gabriel
Model_Folder.upload_stream("saved_model.model", bytes_container)