How can I add p-values estimation to my logistic regressions

SimonDeschamps Registered Posts: 2 ✭✭✭✭
I am currently leading a statistical analysis on absenteism data. In this study, I am studying the influence of multiple factors on employees' presence at work. But anytime i use the logistic regression i can't get p-values for the factors' coefficents (except when I use a PCA to reduce the dimension but in that case I can't interpret the results, which does not serve my case either)

Does anyone know how to recover that on Dataiku?



Best Answer

  • Clément_Stenac
    Clément_Stenac Dataiker, Dataiku DSS Core Designer, Registered Posts: 753 Dataiker
    Answer ✓

    DSS only shows p-values when there are less than 1000 coefficients (after preprocessing - so each categorical value becomes a coefficient). Even if you have less than 1000 coefficients, computing p-values is not always possible due to numerical issues.

    Beware that logistic regression in DSS is always regularized, and p-values are not strictly defined for regularized regressions


  • SimonDeschamps
    SimonDeschamps Registered Posts: 2 ✭✭✭✭
    Thank your for that (really) quick answer. However I only have 14 columns, with 52 categorical values in total so I am guessing that i'm facing those "numerical issues".

    Could you explain what they are and how to get around?

    Many thanks
  • Alex_Combessie
    Alex_Combessie Alpha Tester, Dataiker Alumni Posts: 539 ✭✭✭✭✭✭✭✭✭
    If you want to use p-values for rigorous statistical tests, I would advise using a logistic regression library which does not apply regularization. The scikit-learn version we use in the visual machine learning feature is regularized, which is better for classification performance, but less so for interpretability.
    There is a Python implementation for unregularized logistic regression (a.k.a. logit) in the library statsmodel. Alternatively, you could use many R packages such as glm.
Setup Info
      Help me…