Build a Custom Scikit-Learn Regression Model: Step-by-Step Guide

Join thousands of students who advanced their careers with MachineLearningPlus. Go from Beginner to Data Science (AI/ML/Gen AI) Expert through a structured pathway of 9 core specializations and build industry grade projects.

Creating custom regressors in scikit-learn means building your own machine learning models that follow scikit-learn’s API conventions, allowing them to work seamlessly with pipelines, grid search, and all other scikit-learn tools.

Ever hit a wall where existing scikit-learn regressors just don’t fit your specific problem? Maybe you need a model that minimizes a custom loss function, or you want to implement a research paper’s algorithm. Building your own regressor lets you solve unique problems while keeping all the power of scikit-learn’s ecosystem.

This tutorial will walk you through creating three custom regressors from scratch. You’ll learn the core concepts and build practical examples you can adapt for your own needs.

1. Why Build Custom Regressors?

Standard scikit-learn regressors work great for most problems. But sometimes you need something specific.

Linear regression minimizes squared error. What if you care more about absolute error?

Random forests use decision trees. What if you want to use a completely different approach?

Existing models optimize common metrics. What if your business has a unique cost function?

Building custom regressors lets you:
– Implement research papers and novel algorithms
– Create domain-specific models for your industry
– Experiment with custom loss functions
– Combine multiple algorithms in new ways

In this, we will build in such a way that, it works with everything: pipelines, cross-validation, grid search, model selection. Everything.

2. The Core Components: BaseEstimator and RegressorMixin

Every scikit-learn estimator inherits from BaseEstimator. Every regressor also inherits from RegressorMixin. These give you the standard methods like get_params(), set_params(), and score() for free.

The scikit-learn API design follows key principles: consistency, composition, and sensible defaults. This consistent interface is what makes custom estimators so powerful.

Let’s start with the absolute basics:

import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

# Our first custom regressor
class BasicRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, demo_param=1.0):
        self.demo_param = demo_param

    def fit(self, X, y):
        # Validate inputs
        X, y = check_X_y(X, y)

        # Store training info
        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True

        return self

    def predict(self, X):
        # Check if fitted
        check_is_fitted(self)

        # Validate input
        X = check_array(X)

        # Make predictions (dummy for now)
        return np.ones(X.shape[0]) * self.demo_param

print("Basic regressor created!")

That’s the skeleton.

Every custom regressor needs:
– __init__() to set parameters
– fit() to train the model
– predict() to make predictions

The validation functions (check_X_y, check_array, check_is_fitted) handle edge cases and error checking.

3. Building a Simple Mean Regressor

Let’s build something useful. A mean regressor predicts the training set average for every input. Simple but surprisingly effective as a baseline.

class MeanRegressor(BaseEstimator, RegressorMixin):
    def __init__(self):
        pass

    def fit(self, X, y):
        # Validate inputs
        X, y = check_X_y(X, y)

        # Learn the mean
        self.mean_ = np.mean(y)

        # Store metadata
        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True

        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        # Return mean for all predictions
        return np.full(X.shape[0], self.mean_)

# Test it out
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=42)
print(f"Sample data shapes: X={X.shape}, y={y.shape}")

mean_reg = MeanRegressor()
mean_reg.fit(X, y)
predictions = mean_reg.predict(X[:5])
print(f"Mean regressor predictions: {predictions}")
print(f"Actual mean: {np.mean(y):.4f}")

This regressor ignores the features completely and just predicts the mean. But it follows the scikit-learn API perfectly.

4. Creating a Least Absolute Deviations (LAD) Regressor

Now let’s build something more sophisticated. Linear regression minimizes squared error, but what if you want to minimize absolute error instead? That’s called Least Absolute Deviations (LAD) regression.

LAD regression is useful because it doesn’t punish outliers as heavily as mean squared error, making it more robust to extreme values.

from scipy.optimize import minimize

class LADRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, fit_intercept=True):
        self.fit_intercept = fit_intercept

    def fit(self, X, y):
        X, y = check_X_y(X, y)

        # Add intercept column if needed
        if self.fit_intercept:
            X_with_intercept = np.column_stack([np.ones(X.shape[0]), X])
        else:
            X_with_intercept = X

        # Define loss function (mean absolute error)
        def loss_function(coefficients):
            predictions = X_with_intercept @ coefficients
            return np.mean(np.abs(y - predictions))

        # Initialize coefficients
        initial_coefs = np.zeros(X_with_intercept.shape[1])

        # Optimize
        result = minimize(loss_function, initial_coefs, method='BFGS')

        # Store results
        if self.fit_intercept:
            self.intercept_ = result.x[0]
            self.coef_ = result.x[1:]
        else:
            self.intercept_ = 0
            self.coef_ = result.x

        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True

        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        return X @ self.coef_ + self.intercept_

# Test LAD regressor
lad_reg = LADRegressor()
lad_reg.fit(X, y)
lad_predictions = lad_reg.predict(X[:5])
print(f"LAD regressor predictions: {lad_predictions}")
print(f"LAD coefficients: {lad_reg.coef_[:3]}...")  # Show first 3

This regressor minimizes mean absolute error instead of mean squared error. The scipy.optimize.minimize function does the heavy lifting.

5. Advanced Example: Weighted Regression

Let’s build a regressor that accepts sample weights. This is useful when some training examples are more important than others.

class WeightedRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, alpha=1.0):
        self.alpha = alpha  # Regularization parameter

    def fit(self, X, y, sample_weight=None):
        X, y = check_X_y(X, y)

        if sample_weight is None:
            sample_weight = np.ones(X.shape[0])
        else:
            sample_weight = np.array(sample_weight)

        # Add intercept
        X_with_intercept = np.column_stack([np.ones(X.shape[0]), X])

        # Weighted least squares with regularization
        W = np.diag(sample_weight)
        regularization = self.alpha * np.eye(X_with_intercept.shape[1])
        regularization[0, 0] = 0  # Don't regularize intercept

        # Solve: (X'WX + αI)β = X'Wy
        XtWX = X_with_intercept.T @ W @ X_with_intercept
        XtWy = X_with_intercept.T @ W @ y

        coefficients = np.linalg.solve(XtWX + regularization, XtWy)

        self.intercept_ = coefficients[0]
        self.coef_ = coefficients[1:]
        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True

        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        return X @ self.coef_ + self.intercept_

# Test with weights
weights = np.random.exponential(1, size=X.shape[0])  # Random weights
weighted_reg = WeightedRegressor(alpha=0.1)
weighted_reg.fit(X, y, sample_weight=weights)
weighted_predictions = weighted_reg.predict(X[:5])
print(f"Weighted regressor predictions: {weighted_predictions}")

This regressor uses weighted least squares. Samples with higher weights have more influence on the final model.

6. Testing Your Custom Regressors

Scikit-learn provides tools to test if your estimator follows the API correctly:

from sklearn.utils.estimator_checks import check_estimator

# Test our regressors
try:
    check_estimator(MeanRegressor())
    print("✓ MeanRegressor passes all tests")
except Exception as e:
    print(f"✗ MeanRegressor failed: {e}")

try:
    check_estimator(LADRegressor())
    print("✓ LADRegressor passes all tests")
except Exception as e:
    print(f"✗ LADRegressor failed: {e}")

try:
    check_estimator(WeightedRegressor())
    print("✓ WeightedRegressor passes all tests")
except Exception as e:
    print(f"✗ WeightedRegressor failed: {e}")

The check_estimator function runs dozens of tests to ensure your estimator behaves correctly.

7. Using Your Regressors with Scikit-Learn Tools

The real power comes from integration. Your custom regressors work with all scikit-learn tools:

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error

# Cross-validation
cv_scores = cross_val_score(LADRegressor(), X, y, cv=5, scoring='neg_mean_absolute_error')
print(f"LAD Regressor CV scores: {-cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Grid search for hyperparameters
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0]}
grid_search = GridSearchCV(WeightedRegressor(), param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X, y)
print(f"Best alpha for WeightedRegressor: {grid_search.best_params_['alpha']}")

# Pipeline with preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LADRegressor())
])
pipeline.fit(X, y)
pipeline_predictions = pipeline.predict(X[:5])
print(f"Pipeline predictions: {pipeline_predictions}")

Your custom regressors integrate seamlessly with scikit-learn’s ecosystem.

8. Real-World Application: Custom Loss Functions

Here’s a practical example. Imagine you’re predicting house prices, but overestimating is much worse than underestimating (maybe you’re setting budgets). You need asymmetric loss.

class AsymmetricRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, overestimate_penalty=10.0, underestimate_penalty=1.0):
        self.overestimate_penalty = overestimate_penalty
        self.underestimate_penalty = underestimate_penalty

    def fit(self, X, y):
        X, y = check_X_y(X, y)

        X_with_intercept = np.column_stack([np.ones(X.shape[0]), X])

        def asymmetric_loss(coefficients):
            predictions = X_with_intercept @ coefficients
            errors = y - predictions

            # Penalty based on direction of error
            penalties = np.where(
                errors > 0,  # underestimate (actual > predicted)
                self.underestimate_penalty * np.abs(errors),
                self.overestimate_penalty * np.abs(errors)  # overestimate
            )
            return np.mean(penalties)

        # Optimize
        initial_coefs = np.zeros(X_with_intercept.shape[1])
        result = minimize(asymmetric_loss, initial_coefs, method='BFGS')

        self.intercept_ = result.x[0]
        self.coef_ = result.x[1:]
        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True

        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        return X @ self.coef_ + self.intercept_

# Test asymmetric regressor
asymmetric_reg = AsymmetricRegressor(overestimate_penalty=5.0, underestimate_penalty=1.0)
asymmetric_reg.fit(X, y)
asymmetric_predictions = asymmetric_reg.predict(X[:5])
print(f"Asymmetric regressor predictions: {asymmetric_predictions}")

# Compare with regular linear regression
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(X, y)
linear_predictions = linear_reg.predict(X[:5])
print(f"Linear regression predictions: {linear_predictions}")

This regressor penalizes overestimation more heavily than underestimation, making it suitable for budget-sensitive applications.

9. Advanced Features and Best Practices

When building production-ready regressors, consider these advanced features:

class ProductionRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, learning_rate=0.01, max_iter=1000, tol=1e-6, random_state=None):
        self.learning_rate = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.random_state = random_state

    def fit(self, X, y):
        X, y = check_X_y(X, y)

        # Set random state for reproducibility
        np.random.seed(self.random_state)

        # Initialize parameters
        n_features = X.shape[1]
        self.coef_ = np.random.normal(0, 0.01, n_features)
        self.intercept_ = 0.0

        # Add intercept to X
        X_with_intercept = np.column_stack([np.ones(X.shape[0]), X])
        coefficients = np.concatenate([[self.intercept_], self.coef_])

        # Gradient descent
        prev_loss = float('inf')
        for iteration in range(self.max_iter):
            # Forward pass
            predictions = X_with_intercept @ coefficients
            loss = np.mean((y - predictions) ** 2)

            # Check convergence
            if abs(prev_loss - loss) < self.tol:
                print(f"Converged after {iteration} iterations")
                break

            # Backward pass (compute gradients)
            residuals = predictions - y
            gradients = (X_with_intercept.T @ residuals) / X.shape[0]

            # Update parameters
            coefficients -= self.learning_rate * gradients
            prev_loss = loss

        # Store final parameters
        self.intercept_ = coefficients[0]
        self.coef_ = coefficients[1:]
        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True

        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        return X @ self.coef_ + self.intercept_

# Test production regressor
prod_reg = ProductionRegressor(learning_rate=0.01, max_iter=500, random_state=42)
prod_reg.fit(X, y)
prod_predictions = prod_reg.predict(X[:5])
print(f"Production regressor predictions: {prod_predictions}")

This example shows proper parameter handling, convergence checking, and random state management.

10. Performance Comparison

Let’s compare all our custom regressors:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Create more substantial test data
X_large, y_large = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_large, y_large, test_size=0.2, random_state=42)

# Test all regressors
regressors = {
    'Mean': MeanRegressor(),
    'LAD': LADRegressor(),
    'Weighted': WeightedRegressor(alpha=0.1),
    'Asymmetric': AsymmetricRegressor(overestimate_penalty=3.0),
    'Production': ProductionRegressor(learning_rate=0.001, max_iter=1000, random_state=42),
    'LinearRegression': LinearRegression()  # Baseline
}

results = {}
for name, regressor in regressors.items():
    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)

    results[name] = {'MSE': mse, 'MAE': mae}
    print(f"{name:12} - MSE: {mse:.4f}, MAE: {mae:.4f}")

print("\nBest MSE:", min(results.items(), key=lambda x: x[1]['MSE']))
print("Best MAE:", min(results.items(), key=lambda x: x[1]['MAE']))

This comparison helps you understand when to use each regressor type.

11. Common Pitfalls and How to Avoid Them

When building custom regressors, watch out for these issues:

# WRONG: Don't store training data in __init__
class BadRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, X, y):  # DON'T DO THIS
        self.X = X
        self.y = y

# RIGHT: Store parameters only
class GoodRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, param=1.0):
        self.param = param

# WRONG: Don't modify input parameters
class BadRegressor2(BaseEstimator, RegressorMixin):
    def __init__(self, param_list=[]):  # Mutable default!
        self.param_list = param_list
        self.param_list.append(1)  # Modifying input!

# RIGHT: Copy mutable parameters
class GoodRegressor2(BaseEstimator, RegressorMixin):
    def __init__(self, param_list=None):
        if param_list is None:
            param_list = []
        self.param_list = param_list.copy()

# WRONG: Don't forget to return self from fit
class BadRegressor3(BaseEstimator, RegressorMixin):
    def fit(self, X, y):
        # ... fitting logic ...
        pass  # Missing return self!

# RIGHT: Always return self
class GoodRegressor3(BaseEstimator, RegressorMixin):
    def fit(self, X, y):
        # ... fitting logic ...
        return self

print("Common pitfalls demonstrated!")

Following these patterns ensures your regressor works properly with scikit-learn.

12. Extending to Classification and Transformation

The same principles apply to classifiers and transformers:

from sklearn.base import ClassifierMixin, TransformerMixin

# Custom classifier template
class CustomClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, param=1.0):
        self.param = param

    def fit(self, X, y):
        X, y = check_X_y(X, y)
        self.classes_ = np.unique(y)  # Required for classifiers
        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True
        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)
        # Return most common class (dummy implementation)
        return np.full(X.shape[0], self.classes_[0])

# Custom transformer template
class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, scale_factor=1.0):
        self.scale_factor = scale_factor

    def fit(self, X, y=None):
        X = check_array(X)
        self.n_features_in_ = X.shape[1]
        self.is_fitted_ = True
        return self

    def transform(self, X):
        check_is_fitted(self)
        X = check_array(X)
        return X * self.scale_factor

print("Classifier and transformer templates ready!")

The patterns are nearly identical across all estimator types.

The power of custom regressors isn’t just in solving unique problems – it’s in solving them while keeping all the tools and workflows you already know and love.

Reference: The original concept of scikit-learn’s API design is detailed in the seminal paper “API design for machine learning software: experiences from the scikit-learn project” by Buitinck et al. (2013), which established the principles that make custom estimators so powerful.

Machine Learning