Max Halford

Locally weighted bagging

January 25, 2018 | 29 minutes read

Last October marked the start of my PhD, which means that I haven’t had a lot of time to work on my side projects. However, I have been fiddling with koza, a tool for performing symbolic regression. It’s going quite well so I decided to implement bagging to see if it improved the predictive power. Needless to say I went through a few hoops to get all this working nicely – koza is written Go, so I couldn’t build on the shoulders of others because Go isn’t very popular in the machine learning community. As it turns out, symbolic regression is a very volatile method and bagging didn’t work as well as I expected. The problem seems to be that a very bad predictor deteriorates the overall performance if it is included in the bagging. I have an idea, which I am calling locally weighted bagging, to partially solve this issue. However, I first want to give an overview of bagging.

Vanilla bagging

In machine learning, the term bagging is a portmanteau word short for bootstrap aggregation. In short bagging is a meta-model that combines the predictions of individuals models in order to improve the stability and accuracy of the final prediction. It works by repeatidly sampling the training set and training a model on the sample. Each model is thus trained on a different sample of the original dataset. At prediction time each model makes a prediction and the predictions are averaged. In so-called vanilla mode the sampling is done with replacement and the arithmetic mean is used to average the individual predictions. At a higher level bagging is part of a family of methods known as ensemble methods where the general goal is to combine models. The rule of thumb is that ensemble methods work rather well if the models are good and have uncorrelated predictions. Indeed there is no point is mixing good models that all say the same thing. This applies to other domains other than machine learning; for example many detectives with different backgrounds and experiences will usually crack a case faster than a maverick – even if he’s really good.

Overall bagging is not very complicated, however I’m still going to implement it from scratch so that the final method I will present doesn’t come out of the blue. I decided to make my code follow the scikit-learn idiom and implement their fit and predict API for obvious compatibility reasons. The main benefit of doing so is that it simplifies the process of comparing my method with other ones.

import numpy as np
from sklearn.base import clone
from sklearn.base import RegressorMixin
from sklearn.utils import check_random_state
from sklearn.ensemble.base import BaseEnsemble


class BaggingRegressor(BaseEnsemble, RegressorMixin):

    def __init__(self,
                 base_estimator=None,
                 n_estimators=10,
                 random_state=None):
        super().__init__(base_estimator=base_estimator, n_estimators=n_estimators)

        self.random_state = random_state

    def fit(self, X, y):

        random_state = check_random_state(self.random_state)

        n, p = X.shape
        self.estimators_ = []
        self.estimators_features_ = []

        for i in range(self.n_estimators):

            # Sample observations and features
            rows = random_state.randint(0, n, n)
            cols = random_state.randint(0, p, p)
            X_fit = (X[rows])[:, cols]
            y_fit = y[rows]

            # Clone the estimator
            estimator = clone(self.base_estimator)

            # Fit the estimator
            estimator.fit(X_fit, y_fit)

            # Store the estimator and the associate feature indexes
            self.estimators_.append(estimator)
            self.estimators_features_.append(cols)

    def predict(self, X):

        # Get each estimator's predictions
        y_pred_all = np.array([
            estimator.predict(X[:, cols])
            for estimator, cols in zip(self.estimators_, self.estimators_features_)
        ])

        # Average the predictions
        y_pred = np.average(y_pred_all, axis=0)

        return y_pred

Let’s go through the code step by step. Like most ensemble methods, bagging takes as parameters a base estimator – an estimator is a machine learning model in sklearn vocabulary – and a number of these estimators to use in the ensemble. I’m inheriting from BaseEnsemble to tell sklearn my model is an ensemble and RegressorMixin to tell it my model does regression – of course bagging also works for classification. I’m also going to add a random_state parameter to my model to control it’s randomness and ensure the results I get are reproducible. I don’t want to delve too much into how to implement a model compatible with sklearn’s API, it’s quite well documented on their website.

Now for the fit method. The idea is that for each model we want to include we have to sample the training set and train the model on it. The trick is that if we are sampling the columns we have to keep track of which columns were used by which model so that at prediction time we can feed the right columns to each model. You do not necessarily have to sample the columns as well as the rows, it’s just a different type of bagging. The sampled column indexes for each estimator are stored in estimators_features_. As for the predict method, it is rather straightforward. We simply collect each estimator’s predictions into a list called y_pred_all and compute the arithmetic average along the rows. Of course because each estimator used a different set of columns we select the right ones inside our loop using X[:, cols].

Next we can compare our implementation to other methods made available in sklearn. I’m going to use the Boston house prices dataset provided by sklearn. I’m also going to perform 10 cross-validations with 10 splits over each model to assess an average performance. The comparison metric will be the mean squared error. Finally I’m going to ensure the randomness is controlled by the RANDOM_STATE variable which will ensure I will get the same results if I run my code again. Because bagging is a meta-model, I could feed it any base model and it would work. I’ve chosen to use a decision tree as a base model simply because this results in an algorithm popularly known as random forests – yes, a random forest is just bagging on top of a decision tree. I can thus compare my implementation to the random forest implementation available in sklearn. To make the comparison fair each ensemble model uses 30 estimators.

from sklearn import datasets
from sklearn import ensemble
from sklearn import model_selection
from sklearn import tree


RANDOM_STATE = 42

X, y = datasets.load_boston(return_X_y=True)

n_estimators = 30

models = {
    'Decision tree': tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
    'Vanilla decision tree bagging': BaggingRegressor(
        base_estimator=tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
    'Random forest': ensemble.RandomForestRegressor(
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
}

cv = model_selection.RepeatedKFold(n_repeats=10, n_splits=10, random_state=RANDOM_STATE)

for name, estimator in models.items():
    scores = model_selection.cross_val_score(estimator, X, y, cv=cv, scoring='neg_mean_squared_error')
    print('MSE: {:.3f} (± {:.3f}) [{}]'.format(-np.mean(scores), np.std(scores), name))

I get the following results running this code on my machine.

>>> MSE: 21.189 (± 11.583) [Decision tree]
>>> MSE: 12.495 (± 5.670) [Vanilla decision tree bagging]
>>> MSE: 11.121 (± 4.688) [Random forest]

As expected bagging does better then a single decision tree. Although the bagging isn’t doing as well as the random forest, it’s a good effort given the little amount of code it took to implement it. sklearn’s random forest most probably has some optimisations that give it an edge. Moreover, the default parameters used by DecisionTreeRegressor might not be the same ones used by RandomForestRegressor – I’m too lazy to take care of that for the moment. However, the difference is not significant so it’s difficult to come to any conclusions.

Weighted bagging

Anyway, bagging is nice but it is equally sensitive to the performance of each individual model. Indeed in vanilla mode each model’s contribution to the final prediction is the same. In other words a bad model contributes just as much as a good model. In practice this isn’t a problem because models aren’t very volatile. However this hasn’t been the case in my experience with genetic programming. It would be handy if we could weight each model’s contribution according to it’s assessed performance. Mathematically vanilla bagging is equal to the following equation:

where is the final prediction, is the th set of features, is the number of models, is the th model, is the th set of features and is a prediction. The previous equation is nothing more than a simple arithmetic mean. What we want to do is a weighted mean to modify the contribution of each model.

where is the weight used for each model . The larger the larger the contribution of the associated model. could be anything, but the common way to go is to use the out-of-bag error – often short-handed as OOB. In the context of baging the OOB score is the performance of the model on the data it has not been trained on. This is quite easy to implement; inside our training loop we simply have to make predictions for rows that are not in the sample and calculate the performance. On a side note, sampling with replacement results in a sample that contains approximately % of unique rows, so the OOB score will be obtained by predicting on around % of the rows – which is a healthy percentage. At prediction the weights will be used to calculated the weighted average of the individual predictions.

import numpy as np
from sklearn.base import clone
from sklearn.base import RegressorMixin
from sklearn.utils import check_random_state
from sklearn.ensemble.base import BaseEnsemble


class WeightedBaggingRegressor(BaseEnsemble, RegressorMixin):

    def __init__(self,
                 base_estimator=None,
                 n_estimators=10,
                 random_state=None,
                 oob_metric=metrics.mean_squared_error):

        super().__init__(base_estimator=base_estimator, n_estimators=n_estimators)

        self.random_state = random_state
        self.oob_metric = oob_metric

    def fit(self, X, y):

        random_state = check_random_state(self.random_state)

        n, p = X.shape
        self.estimators_ = []
        self.estimators_features_ = []
        self.oob_scores_ = []

        for i in range(self.n_estimators):

            # Sample observations and features
            rows = random_state.randint(0, n, n)
            cols = random_state.randint(0, p, p)
            oob_rows = ~rows
            X_fit = (X[rows])[:, cols]
            y_fit = y[rows]
            X_val = (X[oob_rows])[:, cols]
            y_val = y[oob_rows]

            # Clone the estimator
            estimator = clone(self.base_estimator)

            # Fit the estimator
            estimator.fit(X_fit, y_fit)

            # Store the estimator and the associate feature indexes
            self.estimators_.append(estimator)
            self.estimators_features_.append(cols)

            # Store the OOB score
            oob_score = self.oob_metric(y_val, estimator.predict(X_val))
            self.oob_scores_.append(oob_score)

    def predict(self, X):

        # Get each estimator's predictions
        y_pred_all = np.array([
            estimator.predict(X[:, cols])
            for estimator, cols in zip(self.estimators_, self.estimators_features_)
        ])

        # Average the predictions
        weights = 1 / np.array(self.oob_scores_)
        y_pred = np.average(y_pred_all, axis=0, weights=weights)

        return y_pred

Unsurprisingly WeightedBaggingRegressor is very much like BaggingRegressor. It takes an additional parameter called oob_metric which is a function that calculates an error between the truth and a prediction. Inside the loop of the fit method, the rows that have not been sampled are kept track in the form of oob_rows which is simply a list of indexes. After the estimator is trained on the sample, the OOB score is obtained and stored in oob_scores_. In the predict method, the predictions are weighted by the inverse of each model’s OOB score. The reason why I’m taking the inverse of the OOB score is because in our case the lower it is the better the model is and thus should it’s contribution be. We could have also emphasised each model’s contribution and used weights = np.exp(1 / np.array(self.oob_scores_)) so that good models contribute even more. However let’s not get ahead of ourselves; in a robust implementation the scale of the weights could be a hyperparameter. Let’s see how this weighted version of bagging fairs experimentally.

from sklearn import datasets
from sklearn import ensemble
from sklearn import model_selection
from sklearn import tree


RANDOM_STATE = 42

X, y = datasets.load_boston(return_X_y=True)

n_estimators = 30

models = {
    'Decision tree': tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
    'Vanilla decision tree bagging': BaggingRegressor(
        base_estimator=tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
    'Weighted decision tree bagging': WeightedBaggingRegressor(
        base_estimator=tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
    'Random forest': ensemble.RandomForestRegressor(
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
}

cv = model_selection.RepeatedKFold(n_repeats=10, n_splits=10, random_state=RANDOM_STATE)

for name, estimator in models.items():
    scores = model_selection.cross_val_score(estimator, X, y, cv=cv, scoring='neg_mean_squared_error')
    print('MSE: {:.3f} (± {:.3f}) [{}]'.format(-np.mean(scores), np.std(scores), name))

This gives me the following output:

MSE: 21.189 (± 11.583) [Decision tree]
MSE: 12.495 (± 5.670) [Vanilla decision tree bagging]
MSE: 12.341 (± 5.654) [Weighted decision tree bagging]
MSE: 11.121 (± 4.688) [Random forest]

The weighted bagging did a teeny bit better, however the gap with vanilla bagging is really slim. Spare your time checking, the results are identical to the previous ones. Of course it isn’t surprising that the weighted version isn’t outstandingly better than the vanilla version, it’s most certainly because the models are generally equally good. However using a weighted average seems to me to be a good idea, mostly because the OOB score reflects a model’s performance on unseed data.

Locally weighted bagging

This is all jolly and good, but something inside my guts tells there is more to do. The problem is that I keep thinking about the following kind of scenario. What if a model is good at predicting certain parts of the data? In some cases it does really well and in some others not so much. Machine learning models used in practice are supposed to be quite robust and maybe I am a bit paranoid. But surely this way of thinking is as justifiable as thinking that models should contribute according to their overall performance. The following chart explains the kind of scenarios I want my bagging procedure to avoid.

import matplotlib.pyplot as plt
import numpy as np

plt.style.use('fivethirtyeight')

np.random.seed(42)

x = np.linspace(-10, 10, 300)
y = np.abs(1 / (1 + np.exp(-x)) - 0.5) + (np.random.rand(len(x)) - 0.5) * 0.1
m1 = 1 / (1 + np.exp(-x)) - 0.5
m2 = -m1
bagging = (m1 + m2) / 2

with plt.xkcd():
    fig, ax = plt.subplots(figsize=(16, 10))
    ax.plot(x, y, 'ro', label='Data')
    ax.plot(x, m1, c='darkblue', linewidth=3, label='Model 1')
    ax.plot(x, m2, c='darkgreen', linewidth=3, label='Model 2')
    ax.plot(x, bagging, c='orange', linewidth=3, label='Bagging')
    ax.tick_params(axis='both', which='major', labelsize=18)
    ax.legend(fontsize=20);

ax.set_xlabel('x (feature)');
ax.set_ylabel('y (target)');

expo-example

We have some data points in red and we want to build a model that learns the underlying distribution. The blue model is good when the -axis goes positive but it is very wrong when the -axis goes negative. The green model exhibits exactly the opposite behaviour. The problem is that bagging both models results in a sub-par model. Weighting doesn’t help because both models have the same performance. In an ideal world we would like our bagging procedure to pick up that the blue model does well when is positive whilst the green model does well when is negative. I realise that the example I have given is extreme and doesn’t happen in practice, however preventing this kind of behaviour could be helpful.

How should the bagging procedure choose which model to use? Well I think that using OOB scores to weight model’s contribution is the right way to go. My proposal is that instead of using an overall OOB score, we could use local weights that translate how well a model does on a specific region of the dataset. Mathematically I want to do the following:

The only thing that has changed is that the weights are specific to each feature set (the ) as well as to each model (the ) instead of just being according to each model. I want to translate how well each model does on the feature set . The problem is that we don’t know how well the model a priori because the underlying truth isn’t known at prediction time. So how do we determine in the meantime? Well the thing is that we do have a training set with know labels. We could look for rows inside the training set that are close to and use them to calculate . In other words would translate how well a model does in the neighbourhood surrounding .

My proposal thus takes during prediction time. The weights are obtained by first finding points with known labels that surround . We can then set to be the average performance of model on the points surrounding . Talk is cheap, let’s get to the code.

import numpy as np
from sklearn import metrics
from sklearn import neighbors
from sklearn.base import clone
from sklearn.base import RegressorMixin
from sklearn.ensemble.base import BaseEnsemble


class LocallyWeightedBaggingRegressor(BaseEnsemble, RegressorMixin):

    def __init__(self,
                 base_estimator=None,
                 n_estimators=10,
                 random_state=None,
                 local_metric=metrics.mean_squared_error,
                 neighbors=neighbors.NearestNeighbors(n_neighbors=50, algorithm='brute')):

        super().__init__(base_estimator=base_estimator, n_estimators=n_estimators)

        self.random_state = random_state
        self.local_metric = local_metric
        self.neighbors = neighbors

    def fit(self, X, y):

        random_state = check_random_state(self.random_state)

        # Store the training set and train the nearest neighbours search on it
        self.X_check = X.copy()
        self.y_check = y.copy()
        self.neighbors.fit(self.X_check)

        n, p = X.shape
        self.estimators_ = []
        self.estimators_features_ = []

        for i in range(self.n_estimators):

            # Sample observations and features
            rows = random_state.randint(0, n, n)
            cols = random_state.randint(0, p, p)
            X_fit = (X[rows])[:, cols]
            y_fit = y[rows]

            # Clone the estimator
            estimator = clone(self.base_estimator)

            # Fit the estimator
            estimator.fit(X_fit, y_fit)

            # Store the estimator and the associate feature indexes
            self.estimators_.append(estimator)
            self.estimators_features_.append(cols)

    def predict(self, X):

        distances, neighbors_idxs = self.neighbors.kneighbors(X)
        y_pred = []

        for i, x in enumerate(X):

            # Determine the neighbourhood around x
            X_neigh = self.X_check[neighbors_idxs[i]]
            y_neigh = self.y_check[neighbors_idxs[i]]

            # Calculate the average score of each estimator on the neighbourhood
            weights = 1 / (0.1 + np.array([
                self.local_metric(
                    y_neigh,
                    estimator.predict(X_neigh[:, self.estimators_features_[j]])
                )
                for j, estimator in enumerate(self.estimators_)
            ]))

            # Retrieve the predictions of each model for x
            y_pred_all = np.array([
                estimator.predict(x[self.estimators_features_[j]].reshape(1, -1))
                for j, estimator in enumerate(self.estimators_)
            ]).reshape(1, -1)[0]

            # Calculate the weighted average of the predictions
            y_pred.append(np.average(y_pred_all, axis=0, weights=weights))

        return y_pred

Compared with the original BaggingRegressor, LocallyWeightedBaggingRegressor – yes the naming is getting a little long – takes two additional parameters which are

  • local_metric is the same as the oob_metric used in WeightedBaggingRegressor and serves the assess the performance of a predicition,
  • and neighbors which is an object that can perform nearest neighbours search on a dataset.

In the fit method the only thing we do more additionally to what was done in BaggingRegressor is to store the training set and to train the nearest neighbours search on it. This way at prediction time we will be able to retrieve the nearest neighbours for any feature set and thus determine each . The neighbors parameter expects to be given a neighbors.NearestNeighbors object from the sklearn library, it’s quite well documented here.

The weights are thus determined at prediction time. For each feature set in the dataset for which predictions are being made we first retrieve the neighbours that surround it. In our setup is determined through the neighbors parameter – in this case I chose 50 but it could have as well have been 3 or 3000 although I’m not even sure there are that many that rows in the dataset. Anyway, the weights are obtained by inversing the so-called local score of each model. Again the inverse has to be used because the lower the score of a model the more I want the model to contribute. I added 0.1 to the denominator to make sure the denominator is never equal to 0.1; indeed this case might occur if a model is perfect for certain regions of a dataset. Then again I could have chosen other values then 0.1 as a safeguard but in the end this ultimately has to be a hyperparameter that I can’t been bothered to tune. What matters is the method and the heuristic I want to express. One key difference between LocallyWeightedBaggingRegressor is that each row in the dataset for which predictions are being made has to be treated separately because the neighbors differ for each row. This makes the predict method rather slow – but who cares if the predictions are better!

How does this compare with the rest of the models?

from sklearn import datasets
from sklearn import ensemble
from sklearn import model_selection
from sklearn import tree


X, y = datasets.load_boston(return_X_y=True)

n_estimators = 30

models = {
    'Decision tree': tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
    'Decision tree bagging': BaggingRegressor(
        base_estimator=tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
    'Weighted decision tree bagging': WeightedBaggingRegressor(
        base_estimator=tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
    'Locally weighted decision tree bagging': LocallyWeightedBaggingRegressor(
        base_estimator=tree.DecisionTreeRegressor(random_state=RANDOM_STATE),
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    ),
    'Random forest': ensemble.RandomForestRegressor(
        n_estimators=n_estimators,
        random_state=RANDOM_STATE
    )
}

cv = model_selection.RepeatedKFold(n_repeats=10, n_splits=10, random_state=RANDOM_STATE)

for name, estimator in models.items():
    scores = model_selection.cross_val_score(estimator, X, y, cv=cv, scoring='neg_mean_squared_error')
    print('MSE: {:.3f} (± {:.3f}) [{}]'.format(-np.mean(scores), np.std(scores), name))

Which gives me the following results:

>>> MSE: 21.189 (± 11.583) [Decision tree]
>>> MSE: 12.495 (± 5.670) [Decision tree bagging]
>>> MSE: 12.341 (± 5.654) [Weighted decision tree bagging]
>>> MSE: 10.347 (± 4.204) [Locally weighted decision tree bagging]
>>> MSE: 11.121 (± 4.688) [Random forest]

Nice! The locally weighted version seems to have done better than sklearn’s random forest implementation. Of course we can’t really assess the performance of a model on a single dataset but the results obtained thus far are encouraging me to down this road a little more. However that is out of the scope of this blog post.

I hope you enjoyed this post. I wouldn’t be surprised if what I’ve done has already been done by others. However I haven’t had much success in finding any related papers. If you know about any work in this direction feel free to shoot me an email – my address is at the bottom of this website.