Regression Analysis Guide
Understand relationships between variables using regression analysis.
Usage
- Define your question: what outcome are you trying to predict or explain?
- Choose regression type: linear (continuous outcome), logistic (binary outcome), multiple (multiple predictors)
- Check assumptions: linearity, normality, independence, homoscedasticity
- Interpret coefficients, R-squared, and p-values correctly
- Validate the model and check for common problems
Examples
- Linear regression (pricing impact): Question: How does price affect sales? Model: Sales = 1,000 - 50 × Price. Interpretation: for every $1 price increase, sales decrease by 50 units. R² = 0.72 means price explains 72% of sales variation. P-value < 0.001 means this relationship is statistically significant, not due to chance
- Multiple regression (salary prediction): Salary = $35,000 + $2,500 × years_experience + $8,000 × has_masters + $12,000 × is_engineering. Each coefficient shows the independent effect of that variable while holding others constant. A master's degree is associated with $8K higher salary regardless of experience
- Logistic regression (churn prediction): Probability of churn = f(days_since_login, support_tickets, contract_type). Output: each variable's odds ratio. Days_since_login OR=1.05 means each additional day without login increases churn odds by 5%. Support_tickets OR=1.3 means each ticket increases odds by 30%
Guidelines
- Correlation does not imply causation — regression shows associations. Causal claims require experimental design or careful quasi-experimental methods
- Check for multicollinearity: if two predictors are highly correlated (r > 0.7), the model can't separate their effects. Check VIF (variance inflation factor) — remove variables with VIF > 5
- R² always increases when you add variables — use adjusted R² or AIC/BIC to compare models with different numbers of predictors
- Look at residual plots, not just R² — a high R² with patterned residuals indicates a misspecified model (wrong functional form)
- Outliers can dramatically affect regression results. Always check Cook's distance to identify influential points
- Start with simple models and add complexity. A 3-variable model you understand beats a 20-variable model you don't