What is this?
This is the documentation for StarBoost, a Python library that implements gradient boosting. Gradient boosting is an efficient and popular machine learning algorithm used for supervised learning.
Doesn’t scikit-learn already do that?
Indeed scikit-learn implements gradient boosting, but the only supported weak learner is decision tree. In essence gradient boosting can be used with other weak learners than decision trees.
What about XGBoost/LightGBM/CatBoost?
The mentioned libraries are the state of the art of gradient boosting decision trees (GBRT). They implement a specific version of gradient boosting that is tailored to decision trees. StarBoost’s purpose isn’t to compete with them. Instead it’s goal is to implement a generic gradient boosting algorithm that works with any weak learner.
A focus of StarBoost is to keep the code readable and commented, instead of obfuscating the algorithm under a pile of tangled code.
What’s a weak learner?
A weak learner is any machine learning model that can learn from labeled data. It’s called “weak” because it usually works better as part of an ensemble (such as gradient boosting). Examples are linear models, radial basis functions, decision trees, genetic programming, neural networks, etc. In theory you could even use gradient boosting as a weak learner.
Is it compatible with scikit-learn?
Yes, it is.
How do I install it?
Barring any weird Python setup, you simply have to run pip install starboost
.
How do I use it?
The following snippet shows a very basic usage of StarBoost. Please check out the examples directory on GitHub for comprehensive examples.
from sklearn import datasets
from sklearn import tree
import starboost as sb
X, y = datasets.load_boston(return_X_y=True)
model = sb.BoostingRegressor(
base_estimator=tree.DecisionTreeRegressor(max_depth=3),
n_estimators=30,
learning_rate=0.1
)
model = model.fit(X, y)
y_pred = model.predict(X)
What are you planning on doing next?
- Logging the progress
- Handling sample weights
- Implement more loss functions
- Make it faster
- Newton boosting (taking into account the information from the Hessian)
- Learning to rank
By the way, why is it called “StarBoost”?
As you might already know, in programming the star symbol *
often refers to the concept of “everything”. The idea is that StarBoost can be used with any weak learner, not just decision trees.
Boosting¶
Regression¶
-
class
starboost.
BoostingRegressor
(loss=None, base_estimator=None, base_estimator_is_tree=False, n_estimators=30, init_estimator=None, line_searcher=None, learning_rate=0.1, row_sampling=1.0, col_sampling=1.0, eval_metric=None, early_stopping_rounds=None, random_state=None)[source]¶ Gradient boosting for regression.
Parameters: - loss (starboost.losses.Loss, default=starboost.loss.L2Loss) – The loss function that will be optimized. At every stage a weak learner will be fit to
the negative gradient of the loss. The provided value must be a class that at the very
least implements a
__call__
method and agradient
method. - base_estimator (sklearn.base.RegressorMixin, default=None) – The weak learner. This must be a regression model. If None then a decision stump will be used.
- base_estimator_is_tree (bool, default=False) – Indicates if the provided
base_estimator
is a tree model or not. Various boosting optimizations specific to trees can be made to improve the overall performance. - n_estimators (int, default=30) – The maximum number of weak learners to train. The final
number of trained weak learners will be lower than
n_estimators
if early stopping happens. - init_estimator (sklearn.base.BaseEstimator, default=None) – The estimator used to make the
initial guess. If
None
then theinit_estimator
property from theloss
will be used. - line_searcher (starboost.line_searchers.LineSearcher, default=None) – A line searcher which
can be used to find the optimal step size during gradient descent. If you’ve set
base_estimator_is_tree
toTrue
and are using one of StarBoost’s losses then an optimal line searcher will be used, meaning you safely set this field toNone
. - learning_rate (float, default=0.1) – The learning rate shrinks the contribution of each tree.
Specifically the descent direction estimated by each weak learner will be multiplied by
learning_rate
. There is a trade-off between learning_rate andn_estimators
. - row_sampling (float, default=1.0) – The ratio of rows to sample at each stage.
- col_sampling (float, default=1.0) – The ratio of columns to sample at each stage.
- eval_metric (function, default=None) – The evaluation metric used to check for early
stopping. If
None
it will default toloss
. - random_state (int, RandomState instance or None, default=None) – If int,
random_state
is the seed used by the random number generator; ifRandomState
instance,random_state
is the random number generator; ifNone
, the random number generator is theRandomState
instance used bynp.random
.
-
fit
(X, y, eval_set=None)[source]¶ Fit a gradient boosting procedure to a dataset.
Parameters: - X (array-like or sparse matrix of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the weak model.
- y (array-like of shape (n_samples,)) – The training target values (strings or integers in classification, real numbers in regression).
- eval_set (tuple of length 2, optional, default=None) – The evaluation set is a tuple
(X_val, y_val)
. It has to respect the same conventions asX
andy
.
Returns: self
-
iter_predict
(X, include_init=False)[source]¶ Returns the predictions for
X
at every stage of the boosting procedure.Parameters: - X (array-like or sparse matrix of shape (n_samples, n_features) – The input samples. Sparse matrices are accepted only if they are supported by the weak model.
- include_init (bool, default=False) – If
True
then the prediction frominit_estimator
will also be returned.
Returns: iterator of arrays of shape (n_samples,) containing the predicted values at each stage
-
predict
(X)¶ Returns the predictions for
X
.Under the hood this method simply goes through the outputs of
iter_predict
and returns the final one.Parameters: X (array-like or sparse matrix of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the weak model. Returns: array of shape (n_samples,) containing the predicted values.
- loss (starboost.losses.Loss, default=starboost.loss.L2Loss) – The loss function that will be optimized. At every stage a weak learner will be fit to
the negative gradient of the loss. The provided value must be a class that at the very
least implements a
Classification¶
-
class
starboost.
BoostingClassifier
(loss=None, base_estimator=None, base_estimator_is_tree=False, n_estimators=30, init_estimator=None, line_searcher=None, learning_rate=0.1, row_sampling=1.0, col_sampling=1.0, eval_metric=None, early_stopping_rounds=None, random_state=None)[source]¶ Gradient boosting for regression.
Parameters: - loss (starboost.losses.Loss, default=starboost.loss.L2Loss) – The loss function that will be optimized. At every stage a weak learner will be fit to
the negative gradient of the loss. The provided value must be a class that at the very
least implements a
__call__
method and agradient
method. - base_estimator (sklearn.base.RegressorMixin, default=None) – The weak learner. This must be a regression model, even though the task is classification. If None then a decision stump will be used.
- base_estimator_is_tree (bool, default=False) – Indicates if the provided
base_estimator
is a tree model or not. Various boosting optimizations specific to trees can be made to improve the overall performance. - n_estimators (int, default=30) – The maximum number of weak learners to train. The final
number of trained weak learners will be lower than
n_estimators
if early stopping happens. - init_estimator (sklearn.base.BaseEstimator, default=None) – The estimator used to make the
initial guess. If
None
then theinit_estimator
property from theloss
will be used. - line_searcher (starboost.line_searchers.LineSearcher, default=None) – A line searcher which
can be used to find the optimal step size during gradient descent. If you’ve set
base_estimator_is_tree
toTrue
and are using one of StarBoost’s losses then an optimal line searcher will be used, meaning you safely set this field toNone
. - learning_rate (float, default=0.1) – The learning rate shrinks the contribution of each tree.
Specifically the descent direction estimated by each weak learner will be multiplied by
learning_rate
. There is a trade-off between learning_rate andn_estimators
. - row_sampling (float, default=1.0) – The ratio of rows to sample at each stage.
- col_sampling (float, default=1.0) – The ratio of columns to sample at each stage.
- eval_metric (function, default=None) – The evaluation metric used to check for early
stopping. If
None
it will default toloss
. - random_state (int, RandomState instance or None, default=None) – If int,
random_state
is the seed used by the random number generator; ifRandomState
instance,random_state
is the random number generator; ifNone
, the random number generator is theRandomState
instance used bynp.random
.
-
fit
(X, y, eval_set=None)[source]¶ Fit a gradient boosting procedure to a dataset.
Parameters: - X (array-like or sparse matrix of shape (n_samples, n_features)) – The training input samples. Sparse matrices are accepted only if they are supported by the weak model.
- y (array-like of shape (n_samples,)) – The training target values (strings or integers in classification, real numbers in regression).
- eval_set (tuple of length 2, optional, default=None) – The evaluation set is a tuple
(X_val, y_val)
. It has to respect the same conventions asX
andy
.
Returns: self
-
iter_predict
(X, include_init=False)[source]¶ Returns the predicted classes for
X
at every stage of the boosting procedure.Parameters: - X (array-like or sparse matrix of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the weak model.
- include_init (bool, default=False) – If
True
then the prediction frominit_estimator
will also be returned.
Returns: iterator of arrays of shape (n_samples, n_classes) containing the predicted classes at each stage.
-
iter_predict_proba
(X, include_init=False)[source]¶ Returns the predicted probabilities for
X
at every stage of the boosting procedure.Parameters: - X (array-like or sparse matrix of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the weak model.
- include_init (bool, default=False) – If
True
then the prediction frominit_estimator
will also be returned.
Returns: iterator of arrays of shape (n_samples, n_classes) containing the predicted probabilities at each stage
-
predict
(X)¶ Returns the predictions for
X
.Under the hood this method simply goes through the outputs of
iter_predict
and returns the final one.Parameters: X (array-like or sparse matrix of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the weak model. Returns: array of shape (n_samples,) containing the predicted values.
-
predict_proba
(X)[source]¶ Returns the predicted probabilities for
X
.Parameters: X (array-like or sparse matrix of shape (n_samples, n_features)) – The input samples. Sparse matrices are accepted only if they are supported by the weak model. Returns: array of shape (n_samples, n_classes) containing the predicted probabilities.
- loss (starboost.losses.Loss, default=starboost.loss.L2Loss) – The loss function that will be optimized. At every stage a weak learner will be fit to
the negative gradient of the loss. The provided value must be a class that at the very
least implements a
Losses¶
L1 loss¶
-
class
starboost.losses.
L1Loss
[source]¶ Computes the L1 loss, also known as the mean absolute error.
Mathematically, the L1 loss is defined as
\(L = \frac{1}{n} \sum_i^n |p_i - y_i|\)
It’s gradient is
\(\frac{\partial L}{\partial y_i} = sign(p_i - y_i)\)
where \(sign(p_i - y_i)\) is equal to 0 if \(p_i\) is equal to \(y_i\). Note that this is slightly different from scikit-learn, which replaces 0s by -1s.
Using
L1Loss
produces mostly the same results as when setting theloss
parameter to'lad'
in scikit-learn’sGradientBoostingRegressor
.-
__call__
(y_true, y_pred)[source]¶ Returns the L1 loss.
Example
>>> import starboost as sb >>> y_true = [0, 0, 1] >>> y_pred = [0.5, 0.5, 0.5] >>> sb.losses.L1Loss()(y_true, y_pred) 0.5
-
default_init_estimator
¶ Returns
starboost.init.QuantileEstimator(alpha=0.5)
.
-
gradient
(y_true, y_pred)[source]¶ Returns the gradient of the L1 loss with respect to each prediction.
Example
>>> import starboost as sb >>> y_true = [0, 0, 1] >>> y_pred = [0.3, 0, 0.8] >>> sb.losses.L1Loss().gradient(y_true, y_pred) array([ 1., 0., -1.])
-
tree_line_searcher
¶ When using
L1Loss
the gradient descent procedure will chase the negative ofL1Loss
’s gradient. The negative of the gradient is solely composed of 1s, -1s, and 0s. It turns out that replacing the estimated descent direction with the median of the according residuals will in fact minimize the overall mean absolute error much faster.This is exactly the same procedure scikit-learn uses to modify the leaves of decision trees in
GradientBoostingRegressor
. However this procedure is more generic and works with any kind of weak learner.
-
L2 loss¶
-
class
starboost.losses.
L2Loss
[source]¶ Computes the L2 loss, also known as the mean squared error.
Mathematically, the L2 loss is defined as
\(L = \frac{1}{n} \sum_i^n (p_i - y_i)^2\)
It’s gradient is
\(\frac{\partial L}{\partial y_i} = p_i\)
Using MSE is equivalent to setting the loss parameter to ls in scikit-learn’s GradientBoostingRegressor.
-
__call__
(y_true, y_pred)[source]¶ Returns the L2 loss.
Example
>>> import starboost as sb >>> y_true = [10, 25, 0] >>> y_pred = [5, 30, 5] >>> sb.losses.L2Loss()(y_true, y_pred) 25.0
-
default_init_estimator
¶ Returns
starboost.init.MeanEstimator()
.
-
Log loss¶
-
class
starboost.losses.
LogLoss
[source]¶ Computes the logarithmic loss.
Mathematically, the log loss is defined as
\(L = -\frac{1}{n} \sum_i^n y_i log(p_i) + (1-y_i) log(1-p_i)\)
It’s gradient is
\(\frac{\partial L}{\partial y_i} = sign(p_i - y_i)\)
This loss works for binary classification as well as for multi-class cases (in which case the loss is usually referred to as “cross-entropy”).
-
__call__
(y_true, y_pred)[source]¶ Returns the log loss.
Example
>>> import starboost as sb >>> y_true = [0, 0, 1] >>> y_pred = [0.5, 0.5, 0.5] >>> sb.losses.LogLoss()(y_true, y_pred) 0.807410...
-
default_init_estimator
¶ Returns
starboost.init.PriorProbabilityEstimator()
.
-
Line searchers¶
During gradient descent the negative gradient of the loss function indicates the direction of descent. A line searcher can be used to determine how far to pursue the direction, or in other words the step size.