from sklearn.datasets import make_regression
regression_dataset = make_regression(
n_samples=10000,
n_features=10,
n_informative=5,
bias=0,
noise=40,
n_targets=1,
random_state=0,
)
Parameter | Description |
---|---|
n_samples |
Number of data points (rows). In this case, 10,000 samples are generated. |
n_features |
Total number of input features (columns). Here, 10 features are created. |
n_informative |
Number of features that actually influence the target variable. The remaining are noise. |
bias |
Constant added to the output y . It shifts the target values up or down. |
noise |
Standard deviation of Gaussian noise added to the output to make the dataset more realistic. |
n_targets |
Number of output variables. Usually 1 for regression. |
random_state |
Seed for the random number generator to ensure reproducibility of results. |
make_regression
If you’re learning machine learning or testing models, you’ll often need synthetic data. The make_regression
function from sklearn.datasets
is perfect for creating regression datasets with controlled complexity.
In this guide, we’ll generate a dataset with 10,000 samples and 10 features, where only 5 features are actually informative. We’ll also add some noise to make the dataset feel more “real-world.”
from sklearn.datasets import make_regression
X, y = make_regression(
n_samples=10000,
n_features=10,
n_informative=5,
noise=40,
random_state=0
)
This dataset can now be used to train any regression model, like linear regression or decision trees.
What Are n_samples
and n_features
in make_regression?
When generating data with make_regression
, two key parameters are:
-
n_samples
: This defines how many data points you'll generate. More samples = better training. -
n_features
: This defines how many input variables (columns) each sample will have.
For example:
X, y = make_regression(n_samples=10000, n_features=10)
You now have a dataset X
with shape (10000, 10)
and a target vector y
with 10,000 values. It’s perfect for training machine learning models on structured tabular data.T
The Power of n_informative
in Data Generation
When generating regression data, not all features need to impact the output. That’s what n_informative
is for.
X, y = make_regression(n_features=10, n_informative=5)
This means:
-
10 features are created.
-
Only 5 are actually meaningful (i.e., they affect the output
y
). -
The remaining 5 are just noise or distractions.
This helps mimic real-world scenarios where not all data is useful.
Adding Realism to Your Synthetic Data with bias
and noise
-
bias
: Adds a constant to the output. -
noise
: Adds Gaussian noise to simulate measurement error or natural randomness.
Example:
X, y = make_regression(bias=0, noise=40)
This adds 40 units of standard deviation noise, making your dataset less “perfect” and more realistic—great for model testing!W
Why Setting random_state
Makes Your Machine Learning Results Reproducible
In any function that involves randomness, you can set random_state
to make results consistent across runs.
X, y = make_regression(random_state=0)
This ensures that every time you run the code, you get the same dataset. It’s essential for debugging, sharing code, and reproducible research.