Check out the first Dataiku 8 Deep Dive focusing on Productivity on October 29th Read More

Question regarding ordinary least squares

Level 2
Level 2
Question regarding ordinary least squares

Hello!

I was using the auto ML features for regression with my own data, and I had a hard time understanding the model coefficients from OLS. So I tried it again with the data from tutorial 2 to predict total revenue. The model coefficients came out to have similar issues. Please see below.

The coefficients for the top influence variables look to be extremely large. I also checked that campaign, user_agent_os, and gender shouldn't have any N/A values. However, when I applied the model to an unlabeled dataset, the predictions seemed to be very reasonable. Could you please give me some insights on this?

Ming_1-1589231526774.png

 

 

2 Replies
Dataiker Alumni

OLS works only if there are no redundant features (no multicollinearity). In this example, there seem to be one issue with gender. As there should not be any N/A for gender, try to use "drop rows" as "missing values" strategy for gender and "Drop one dummy" for "Drop dummy". There is also probably the same issue with both "campaign" and "user_agent" and the same solution may work.

You can also use ridge or lasso regression that will handle the multicollinearity thanks to the regularization.

Level 1

We are facing the same issue: large values appear as intercept and N/A despite we don't have missing values in our categorical data entering the OLS model.

Any explanation on why? or how to avoid it?

0 Kudos