" MicromOne: Common Data Science Interview Questions — Regression

Pagine

Common Data Science Interview Questions — Regression

1. In a linear regression model with ONE independent variable and an intercept, how many coefficients are calculated?

There are:

  • 1 coefficient for the independent variable

  • 1 coefficient for the intercept

So the model has 2 coefficients.

The formula is:

genui{"math_block_widget_always_prefetch_v2":{"content":"y = \beta_0 + \beta_1 x"}}

Where:

  • ( \beta_0 ) = intercept → where the line starts

  • ( \beta_1 ) = slope/coefficient → how much (y) changes when (x) increases

Simple Example

Suppose:

  • (x) = age

  • (y) = height

Then:

  • the intercept represents the starting height

  • the slope tells you how much height increases for each additional year of age

Visually:

  • X-axis → age

  • Y-axis → height

  • the slope of the line = coefficient

2. Does Logistic Regression predict a categorical or numerical target?

Logistic Regression predicts a categorical target.

Usually:

  • 0 / 1

  • yes / no

  • success / failure

Examples:

  • Will the customer buy? → yes/no

  • Is the email spam? → yes/no

  • Will the user churn? → yes/no

Important Difference

Linear Regression

Used for continuous numerical outcomes:

  • house prices

  • salary

  • temperature

Logistic Regression

Used for categories:

  • approved/not approved

  • sick/healthy

  • clicked/did not click

3. Which part of a linear regression result tells you whether an independent variable is statistically significant?

The answer is:

The p-value

Each coefficient has an associated p-value.

The p-value tells you whether that variable has a statistically significant relationship with the dependent variable.

Common Rule

If:

p < 0.05

then the variable is considered statistically significant.

Practical Meaning

  • low p-value → the relationship is probably not random

  • high p-value → the variable may not have a real effect

4. Which part of the regression output tells you about practical significance?

Practical significance is related to the size of the effect.

So you look at:

The coefficient itself

It is not enough for something to be statistically significant.

You must also ask:

“Is the effect large enough to matter in real life?”

Practical Example

Imagine a drug that:

  • increases life expectancy by 37 seconds

  • has a very low p-value

Statistically significant?

Yes.

Practically significant?

Probably not, because 37 seconds is not meaningful in everyday life.

Quick Interview Summary

Linear Regression with 1 variable

  • 2 coefficients:

    • intercept

    • slope

Logistic Regression

  • predicts categories

  • usually binary outcomes (0/1)

Statistical Significance

  • measured using the p-value

  • commonly significant if p < 0.05

Practical Significance

  • measured using the coefficient/effect size

  • larger effect = more meaningful in practice



In the real world, almost everything we do is an estimation process based on statistics. Because of that, there are many ways to get things wrong if you’re not aware of the assumptions behind the methods you’re using.

Whether you’re working with linear regression, hypothesis testing, confidence intervals, Monte Carlo methods, or more complex approaches like ensemble models or deep neural networks, there are always underlying assumptions involved. Being aware of those assumptions—and the ways your analysis can fail—is essential to avoid making overly confident decisions based on hidden weaknesses in your model.

With very complex models, you often need to be especially cautious about overfitting. A model might perform extremely well in certain scenarios but fail to generalize, even if it looks highly accurate on paper.

On the other hand, with simpler models like linear regression, the risk is the opposite: the model may be too simple and produce conclusions that are overly generalized or that ignore important real-world complexity.

So in a sense, these are two extremes: complex models can be too flexible and overfit, while simple models can be too rigid and underfit. Understanding this trade-off—and the assumptions behind each method—is crucial for making reliable data-driven decisions.