Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello, I am new to ML and I'm trying to create a very basic recommendation system for a very simple dataset I have. My dataset only contains productID and customerID and I have performed auto collaborative filtering (using the recommendation system plugin) on this dataset to generate a score. I want an item based recommendation system where I input the customerID and it outputs me which productIDs can be recommended. I am not sure about the schema required for the ML prediction model. The model target is the "score".
Can someone confirm what should be the labeled dataset (the dataset to train the model) and the unlabeled prepared dataset (the input dataset to Predict). The only 3 columns I have available are "customerID", "productID" and the "score". If my unlabeled prepared dataset doesn't include the productID then it complains about incorrect schema when I want to do the Prediction.
For reference, I tried the product recommendation solution offered by Dataiku but as my dataset is very simple, hence I thought I create my own simple flow from scratch, which just includes slight data cleanup/preparation, collaborative filtering and the ML model.
Dataiku version 12 (running locally, free version)
Operating system used: Windows 10 Enterprise
If we want the model to perform as we expect on new data, we need to provide the model the exact same set of features (as the train dataset), with the same names, prepared in the same manner. Please see the following documentation on scoring requirements: https://knowledge.dataiku.com/latest/ml-analytics/model-scoring/concept-scoring-data.html#preparing-...
The input dataset to the scoring recipe should have the same columns ("customerID", "productID" and the "score"), but with an empty target column. You can do this by running the test dataset through the same flow as you had done with the train dataset, and then clear the target column.
Also, it sounds like you would want the ProductID to be your target variable as your goal is to output productID's.
Note, for ease of use - the recommended flow is to start with one input dataset, do your prep and transformations, and feed it to AutoML, which by default DSS randomly splits the input dataset into a training and test dataset: https://doc.dataiku.com/dss/latest/machine-learning/supervised/settings.html#settings-train-test-set
Please let us know if you have any questions.
Thanks for the reply. I followed the scoring documentation which you shared, and yes my target variable is the ProductID. I had a few follow up questions though:
1. After the auto collaborative filtering, I am getting some scores which are >1. Shouldn't the score be in between 0 and 1? What does >1 mean?
2. The output of the prediction comes in float variables, but the target variable - ProductID are individual integer values. How do I train and score the model for me to predict exact ProductID (as it seems wrong to round off the prediction to assume the ProductID)?
For context, the flow contains a item-based simple auto collaborative filtering on the customerID and prouductID columns. The score output is split and one half is used to train the model where customerID and score are set to "no scaling", "pairwise linear combinations" and "pairwise polynomial combinations" are enabled in the design. The trained model is then used for the other split half to get the output which contains the prediction column for ProductIDs
Just to confirm you are talking about the "Auto collaborative filtering" recipe part of the recommendation system plugin, correct? https://www.dataiku.com/product/plugins/recommendation-system/
For this recipe, scores are not normalized, they can just be used to relatively compare them together to rank items. For more information, please see the following doc: https://knowledge.dataiku.com/latest/kb/industry-solutions/product-reco/product-reco.html#feature-en...
Source code is also public - https://github.com/dataiku/dss-plugin-recommendation-system/blob/main/python-lib/query_handlers/auto...
Regarding the prediction variable type, are you feeding in the dataset to AutoML with the productID type as int? The prediction output type should match the datatype that you fed it.
Yes, I'm talking about the "Auto collaborative filtering" plugin. I did take a look at the solution you shared but as my data doesn't contain that level of metadata/complexity which is required in the inputs, so I decided to make my own flow (as shown in the image). Still, my flow is based on the overall flow in the recommendation solution too, where I perform my data preparation and provide it to the auto collaborative filtering plugin and then use that to train my model. (I tired the remove the unnecessary components from the recommendation solution, but it created more errors for me than solutions).
Also, yes my input type for productID is "int" integer. For my labeled dataset to train my model, the columns are: productID as "int" integer, customerID as "int" integer and score as "double" Decimal. The unlabeled dataset contains (to score the model): customerID as "int" integer and score as "double" Decimal and the target for the model is the productID.
Also would really appreciate if you have any feedback on the flow itself
The flow looks good in terms of the pipeline. I would actually make ProductID a String type in this case because some visual recipes will infer the schema. In your prepare recipe, you can change the type directly by clicking on the column itself and selecting the type. The output dataset will then have String, and if you want to propagate this change across the flow you can use the schema propagation tool :
Can you give this a try?
Thanks for the reply! I followed the steps for the schema propagation tool, but unfortunately that didn't work too. But I still made it work by forcing the columns to be "string" "text type. I noticed that even though the type was showing "string" for the storage type in dataset, Dataiku was taking the meaning as "integer" hence, hence I changed meaning to "text" and it takes it as discrete individual values.
I do have a follow up question regarding the whole pipeline though. So I was able to create my model, deploy it and predict it too using the steps shown in this tutorial: https://knowledge.dataiku.com/latest/ml-analytics/model-scoring/concept-scoring-data.html.
But the final Score recipe (apply model on data to predict) does not imply as inference right? For more context, how would I recommend a product for a new customer. Do I need to generate the collaborative score again for the whole dataset with the new score and then send it to the model, or I can directly use the deployed model? (the later method will only have the customerid and not the score feature in the dataset)
If I am understanding you correctly, you need to have all the same features. "If we want the model to perform as we expect on new data, we need to provide the model the exact same set of features, with the same names, prepared in the same manner." Then, the model will use the scores provided to predict the customerids.
I hope this makes sense!