regression coefficient in Dataiku
Hi,
I am used to analyse R regression coefficients and I am a little bit confused about how to do it in dataiku. For instance on the Iris dataset, If I fit a regression on the iris dataset to explain sepal length with the Species and the Petal length I have :
Call:
lm(formula = iris$Sepal.Length ~ iris$Petal.Length + iris$Species)
Residuals:
Min 1Q Median 3Q Max
0.75310 0.23142 0.00081 0.23085 1.03100
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 3.68353 0.10610 34.719 < 2e16 ***
iris$Petal.Length 0.90456 0.06479 13.962 < 2e16 ***
iris$Speciesversicolor 1.60097 0.19347 8.275 7.37e14 ***
iris$Speciesvirginica 2.11767 0.27346 7.744 1.48e12 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.338 on 146 degrees of freedom
Multiple Rsquared: 0.8367, Adjusted Rsquared: 0.8334
Fstatistic: 249.4 on 3 and 146 DF, pvalue: < 2.2e16
The two regression coefficients iris$Speciesversicolor, iris$Speciesvirginica are to compared with the Species taken as reference (Setosa). Meaning, that iris$Speciesvirginica is the difference of sepal length in mean between the species virginica and setosa.
In dataiku, I have three coefficient and I don't know what is the reference. Besides, none of my coefficients are significative in dataiku whereas there are all significative in R :
species = Irisvirginica âââ 9.01e21.3485 0.3129
species = Irissetosa âââ 8.16e21.4032 0.3081
petal_l âââ 4.02e10.2486 0.2450
species = Irisversicolor âââ 4.57e10.10910.0233
Intercept 5.8531
Could you explain why?
Answers

Hi,
The difference in DSS is that 1 dummy variable is created per category so there is no reference species (that's why you have one coefficient per species). I agree that it is closer to the ML view that the Statistical view.
The second difference is that in DSS, we always start by creating a train and test set. So you may have train a regression on 80 % of the data in DSS and 100 % using R.
Does this answer your question ? To explain it further I would need the exact parameters you choose for your regression (exact model, rescaling options, etc ... ).