to make my project I've created a python 3.6 environment. I've installed the required packages and among them there is scikit-learn. Despite that I'm facing a problem that to me is strange. In my recipes I can import the library and its submodules. But when I create an Api endpoint in Api Designer and I try to run test queries it gives me an error, the one that gives the title to this discussion. In my Api settings I have set my environment, so I don't understand why it doesn't work.
In the following I'll describe a little more in detail the "history" of this environment. In the first place I created the environment with the default scikit-learn version, the 0.23, the newest. But with that version in the Api Designer I had the error written above. It showed that error trying to execute the command:
from sklearn.ensemble.partial_dependence import partial_dependence
Searching online it seemed a version problem, so I created another environment with scikit-learn 0.22 and that error disappeared. Indeed I was able to run my test queries and then push the Api to the Api Deployer, with my service perfectly deployed and working.
Lately I've tried to re-run those test queries in the Api Designer and it appeared the same error, but trying to execute this command:
from sklearn.model_selection import KFold
I've then tried to import it in my recipe, with the same env, and it doesn't give any problem. I don't understand where is the problem, a help would be really appreciated.
P.S.: I've tried to create an empty Api endpoint with a new environment that beside the needed default packages has scikit-learn installed and the same error appears, so I suppose it doesn't depend on my work
can you do a "Update" of the code env with "rebuild env" checked, then verify that the scikit-learn installed is of the right version in the "Installed packages" tab? (ie scikit-learn>=0.20,<0.21, like what you see when you use "add sets of packages" in the code env)
I've done what you asked and now in my installed packages scikit-learn is not present anymore, as other packages that I've installed. Do you have any idea why that happened?
did you install them manually from a Python notebook or via the command line? If you have them in the "packages to install" tab, only a failure to build the code env should prevent them from actually being installed
Sorry, I hadn't clear what the rebuild does. I didn't pass it the packages, so of course rebuilding resulted in a default environment without additional packages. I added the set of packages recommended, which includes scikit-learn 0.20.4, plus one of the mandatory packages to run my code, which is joblib.
They result correctly installed, I can see them in the installed packages, but trying to run the queries now it says me that Dev server is running, and that is good because before it wasn't, but still it gives error, saying:
ModuleNotFoundError: No module named 'sklearn.ensemble._forest'
At this point I've another question, to work on dataiku with sklearn a version >=0.20, <0.21 is needed?
sklearn.ensemble._forest is indeed an addition of sklearn v0.22 (was called sklearn.ensemble.forest before). This means that:
- either your code explicitely calls or imports it
- the model you are trying to use was built in a code environment with sklearn >= 0.22 and you're now trying to read it in a code env with sklearn < 0.22 , which is not possible because how pickle works. You'll need to retrain the model in a code env with sklearn 0.20.4
You were right, now it works, really thank you! So as a conclusion, I assume that to work with dataiku is necessary to have scikit-learn of the right version?
(sorry, forgot about the second question along the way)
yes, DSS code around ML assumes sklearn in a given version range. Given the size of the ML codebase, what exactly requires this specific versions and not a later one is hard to say. You can always try to use newer versions of sklearn and check is it fails or not for the particular models you are building, but that's of course not supported by Dataiku.