regression coefficient in Dataiku

UserBird · ‎09-02-2015

Hi,

I am used to analyse R regression coefficients and I am a little bit confused about how to do it in dataiku. For instance on the Iris dataset, If I fit a regression on the iris dataset to explain sepal length with the Species and the Petal length I have :


Call:
lm(formula = iris$Sepal.Length ~ iris$Petal.Length + iris$Species)

Residuals:
     Min       1Q   Median       3Q      Max
-0.75310 -0.23142 -0.00081  0.23085  1.03100

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             3.68353    0.10610  34.719  < 2e-16 ***
iris$Petal.Length       0.90456    0.06479  13.962  < 2e-16 ***
iris$Speciesversicolor -1.60097    0.19347  -8.275 7.37e-14 ***
iris$Speciesvirginica  -2.11767    0.27346  -7.744 1.48e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.338 on 146 degrees of freedom
Multiple R-squared:  0.8367,    Adjusted R-squared:  0.8334
F-statistic: 249.4 on 3 and 146 DF,  p-value: < 2.2e-16

The two regression coefficients iris$Speciesversicolor, iris$Speciesvirginica are to compared with the Species taken as reference (Setosa). Meaning, that iris$Speciesvirginica is the difference of sepal length in mean between the species virginica and setosa.

In dataiku, I have three coefficient and I don't know what is the reference. Besides, none of my coefficients are significative in dataiku whereas there are all significative in R :


species = Iris-virginica   ☆☆☆  9.01e-21.3485   0.3129
species = Iris-setosa  ☆☆☆   8.16e-2-1.4032-       0.3081
petal_l                          ☆☆☆     4.02e-10.2486      0.2450
species = Iris-versicolor ☆☆☆ 4.57e-1-0.1091-0.0233
Intercept   5.8531

Could you explain why?

PGuti · ‎09-02-2015

Hi,

The difference in DSS is that 1 dummy variable is created per category so there is no reference species (that's why you have one coefficient per species). I agree that it is closer to the ML view that the Statistical view.

The second difference is that in DSS, we always start by creating a train and test set. So you may have train a regression on 80 % of the data in DSS and 100 % using R.

Does this answer your question ? To explain it further I would need the exact parameters you choose for your regression (exact model, rescaling options, etc ... ).

regression coefficient in Dataiku

regression coefficient in Dataiku

Labels

Machine Learning

Sign up to take part

regression coefficient in Dataiku

regression coefficient in Dataiku

Labels

Machine Learning