Public

# regression coefficient in Dataiku

Dataiker, Alpha Tester Posts: 535 Dataiker
edited July 16

Hi,

I am used to analyse R regression coefficients and I am a little bit confused about how to do it in dataiku. For instance on the Iris dataset, If I fit a regression on the iris dataset to explain sepal length with the Species and the Petal length I have :

`Call:lm(formula = iris\$Sepal.Length ~ iris\$Petal.Length + iris\$Species)Residuals:     Min       1Q   Median       3Q      Max-0.75310 -0.23142 -0.00081  0.23085  1.03100Coefficients:                       Estimate Std. Error t value Pr(>|t|)    (Intercept)             3.68353    0.10610  34.719  < 2e-16 ***iris\$Petal.Length       0.90456    0.06479  13.962  < 2e-16 ***iris\$Speciesversicolor -1.60097    0.19347  -8.275 7.37e-14 ***iris\$Speciesvirginica  -2.11767    0.27346  -7.744 1.48e-12 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.338 on 146 degrees of freedomMultiple R-squared:  0.8367,    Adjusted R-squared:  0.8334F-statistic: 249.4 on 3 and 146 DF,  p-value: < 2.2e-16`

The two regression coefficients iris\$Speciesversicolor, iris\$Speciesvirginica are to compared with the Species taken as reference (Setosa). Meaning, that iris\$Speciesvirginica is the difference of sepal length in mean between the species virginica and setosa.

In dataiku, I have three coefficient and I don't know what is the reference. Besides, none of my coefficients are significative in dataiku whereas there are all significative in R :

`species = Iris-virginica   âââ  9.01e-21.3485   0.3129species = Iris-setosa  âââ   8.16e-2-1.4032-       0.3081petal_l                          âââ     4.02e-10.2486      0.2450species = Iris-versicolor âââ 4.57e-1-0.1091-0.0233Intercept   5.8531`

Could you explain why?

Tagged:

• Registered Posts: 5 ✭✭✭✭✭
Hi,

The difference in DSS is that 1 dummy variable is created per category so there is no reference species (that's why you have one coefficient per species). I agree that it is closer to the ML view that the Statistical view.

The second difference is that in DSS, we always start by creating a train and test set. So you may have train a regression on 80 % of the data in DSS and 100 % using R.

Does this answer your question ? To explain it further I would need the exact parameters you choose for your regression (exact model, rescaling options, etc ... ).

Help me…