• Max Halford

Incremental machine learning

## Online machine learning with creme

### Max Halford

#### 11th of May 2019, Amsterdam

</div>

???

Hello!

---

### Outline

.bullets[
1. You're not doing ML the correct way
2. Cool kids do online learning
3. Introducing `creme`, the fresh kid on the block
4. Forecasting League of Legends match durations
5. Future work
]

???

My goal is to convince you that the way most people do machine learning isn't the way to go in production scenarios. Online learning is more suitable and nullifies many issues. I'll introduce a new library me and some friends have been working on, and I'll show you we're using for a pet project
trying to forecast the duration of League of Legends matches.

---

### Batch learning

.bullets[
1. Collect features $X$ and labels $Y$
2. Train a model on $(X, Y)$
3. Save the model somewhere
4. Load the model to make predictions
]

With code:

```python
>>> model.fit(X_train, y_train)
>>> dump(model, 'model.json')
>>> model = load('model.json')
>>> y_pred = model.predict(X_test)
```

???

The most common form of machine learning is called batch learning. It's what you do when you use scikit-learn and do Kaggle competitions. I won't dwell on this too long as I'm pretty sure you're all acquainted with it. In short: load, fit, predict.

---

### Batch machine learning in production

???

There isn't a clear pattern as to how to put into production a batch learning system. Typically this involves training a model periodically, and then serializing it before saving it somewhere. API calls can then be issues to make predictions. This is such a difficult thing to get right that startups and tech giants are building products around it.

---

### Batch machine learning in production is hard

???

Works fine for Kaggle cases and when you have a clear split between train and test. But things are not so smooth in production.

---

background-color: #2ac380
class: center, middle, white

# Models have to be retrained from scratch with new data

---

background-color: #dbaaa8
class: center, middle, white

# Ever increasing computational requirements

---

background-color: #1f282d
class: center, middle, white

# Models are static and rot faster than bananas 🍌

---

background-color: #e69138
class: center, middle, white

# Features that you develop locally might not be available in real-time

???

This is more subtle. Sometimes it isn't possible to recreate features for past instances. If you don't store information for an instance at every particular point in time, then it isn't possible to reproduce the true state during your feature extraction process. In other words you need a solid data engineering team that cares about time.

---

> It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if only he could stop bumping for a moment and think of it.

???

And just like Winnie the Pooh, we're spending too much time banging our heads to be able to think about a better way of doing things.

---

background-color: #607bd4
class: middle, white

## Online learning

.bigbullets[
- Data comes from a stream
- Models learn 1 observation at a time
- Observations do not have to be stored
- Features and labels are dynamic
]

---

background-color: #FF7F50
class: middle, white

## Different names, same thing

.bigbullets[
- Online learning
- Incremental learning
- Sequential learning
- Iterative learning
- Out-of-core learning
]

---

background-color: #008080
class: middle, white

## Applications

.bigbullets[
- Time series forecasting
- Spam filtering
- Recommender systems
- Ad placement
- Internet of things
- Basically, <span style="text-decoration:underline">anything event based</span>
]

---

### Online learning in a nutshell

---

background-color: #e66868ff
class: middle, white

## Why is batch learning so popular?

.bigbullets[
- Taught at university
- (Bad) habits
- Hype
- Kaggle
- Library availability
- Higher accuracy
]

---

.bullets[
- Python library for doing online learning
- API heavily inspired from `sklearn` and easy to pick up
- Focus on feature extraction, not just learning
- First commit in January 2019
- Version `0.1.0` released earlier this week
]

---

### Observations

Representing an observation with a `dict` is natural:

```python
x = {
    'date': dt.datetime(2019, 4, 22),
    'price': 42.95,
    'shop': 'Ikea'
}
```

- Values can be of any type
- Feature names can be used instead of array indexes
- Python's standard library plays nicely with `dict`s

---

### Targets

A target's type depends on the context:

```python
# Regression
y = 42

# Binary classification
y = True

# Multi-class classification
y = 'setosa'

# Multi-output regression
y = {
   height: 29.7,
   width: 21
}
```

---

### Streaming data

```python
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

for x, y in X_y:
    print(x, y)
```

- `X_y` is a **generator** and consumes a tiny amount of memory
- The point is that we want to **handle data points one by one**
- Source depends on your use case (can be Kakfa producer, CSV file, HTTP requests, etc.)

---

### Training with `fit_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
    model.fit_one(x, y)
```

Every `creme` estimator has a `fit_one` method

---

### Predicting with `predict_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
*   y_pred_before = model.predict_one(x)
    model.fit_one(x, y)
*   y_pred_after = model.predict_one(x)
```

- Classifiers also have a `predict_proba_one` method
- Transformers have a `transform_one` method
- Training and predicting phases are inter-leaved

---

### Monitoring performance

```python
from creme import linear_model
from creme import metrics
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

* metric = metrics.Accuracy()

for x, y in X_y:
    y_pred = model.predict_one(x)
    model.fit_one(x, y)
*   metric.update(y, y_pred)
    print(metric)
```

Validation score is available for free! No need to do any cross-validation. You can also use `online_score` from the `model_selection` module.

---

### Composing estimators is easy

```python
from creme import linear_model
from creme import metrics
from creme import preprocessing
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

scale = preprocessing.StandardScaler()
lin_reg = linear_model.LogisticRegression()

* model = scale | lin_reg  # Pipeline shorthand

metric = metrics.Accuracy()

for x, y in X_y:
    y_pred = model.predict_one(x)
    model.fit_one(x, y)
    metric.update(y, y_pred)
    print(metric)
```

---

### `creme`'s current modules

.left-column[
- `cluster`
- `compat`
- `compose`
- `datasets`
- `dummy`
- `ensemble`
- `feature_extraction`
- `feature_selection`
- `impute`
- `linear_model`
]
.right-column[
- `model_selection`
- `multiclass`
- `naive_bayes`
- `optim`
- `preprocessing`
- `reco`
- `stats`
- `stream`
- `tree`
- `utils`
]

---

### Online mean

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$

```python
>>> mean = creme.stats.Mean()

>>> mean.update(5)
>>> mean.get()
5

>>> mean.update(10)
>>> mean.get()
7.5
```

---

### Online variance

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$
3. $s\_{i+1} = s\_i + (x - \mu\_{i}) \times (x - \mu\_{i+1})$ (running sum of squares)
4. $\sigma\_{i+1} = \frac{s\_{i+1}}{n}$

```python
>>> variance = creme.stats.Variance()

>>> X = [2, 3, 4, 5]
>>> for x in X:
...     variance.update(x)
>>> variance.get()
1.25

>>> numpy.var(X)
1.25

```

???

This is called Welford's algorithm, it can be extended to skew and kurtosis

---

### Standard scaling

With the mean and the variance, we can scale incoming data so that it has zero mean and unit variance.

```python
>>> scaler = creme.preprocessing.StandardScaler()

>>> for x in [2, 3, 4, 5]:
...    print(x, 'becomes', scaler.fit_one(x)['x'])
2 becomes 0.0
3 becomes 0.9999999999999996
4 becomes 1.224744871391589
5 becomes 1.3416407864998738

```

---

### Linear regression (1)

Model is $y_t = \langle w_t x_t \rangle + b_t$. The weights $w_t$ can be learnt with any online gradient descent algorithm, for example:

- Stochastic gradient descent (SGD)
- Adam
- RMSProp
- Follow the Regularized Leader (FTRL)

```python
from creme import linear_model
from creme import optim

lin_reg = linear_model.LinearRegression(
    optimizer=optim.Adam(lr=0.01)
)
```

---

### Linear regression (2)

The intercept term $b_t$ is difficult to learn

Some people (Léon Bottou, scikit-learn) suggest to use a lower learning rate for the intercept than for the weights (heuristic but okay)

`creme` uses any running statistic from the `creme.stats` module, which is a powerful trick

```python
from creme import linear_model
from creme import optim
from creme import stats

lin_reg = linear_model.LinearRegression(
    optimizer=optim.Adam(lr=0.01),
    intercept=stats.RollingMean(42)
)
```

---

### Online aggregations

```python
>>> import creme

>>> X = [
...     {'place': 'Taco Bell', 'revenue': 42},
...     {'place': 'Burger King', 'revenue': 16},
...     {'place': 'Burger King', 'revenue': 24},
...     {'place': 'Taco Bell', 'revenue': 58}
... ]

>>> agg = creme.feature_extraction.Agg(
...     on='revenue',
...     by='place',
...     how=creme.stats.Mean()
... )

>>> for x in X:
...     print(agg.fit_one(x).transform_one(x))
{'revenue_mean_by_place': 42.0}
{'revenue_mean_by_place': 16.0}
{'revenue_mean_by_place': 20.0}
{'revenue_mean_by_place': 50.0}
```

---

### Bagging (1)

Bagging is a popular and simple ensemble algorithm:

1. Pick $m$ base models (usually identical copies)
2. Train each model on a sample with replacement
3. Average the predictions of each model on the test set

Each observation is sampled $K$ times where $K$ follows a binomial distribution:

$$P(K=k) = {n \choose k} \times (\frac{1}{n})^k \times (1 - \frac{1}{n})^{n-k}$$

As $n$ grows towards infinity, $K$ can be approximated by a Poisson(1):

$$P(K=k) \sim \frac{e^{-1}}{k!} $$

---

### Bagging (2)

`ensemble.BaggingClassifier` is very simple:

```python
def fit_one(self, x, y):

for estimator in self.estimators:
        for _ in range(self.rng.poisson(1)):
            estimator.fit_one(x, y)

return self

def predict_proba_one(self, x):
    y_pred = statistics.mean(
        estimator.predict_proba_one(x)[True]
        for estimator in self.estimators
    )
    return {
        True: y_pred,
        False: 1 - y_pred
    }
```

---

### League of Legends match duration forecasting (1)

---

### League of Legends match duration forecasting (2)

---

### Architecture

---

### Django model

```python
from django.db import models
from picklefield.fields import PickledObjectField

class CremeModel(models.Model):
    name = models.TextField(unique=True)
    pipeline = PickledObjectField()

class Meta:
        db_table = 't_models'
        verbose_name_plural = 'models'

def fit_one(self, x, y):
        self.pipeline.fit_one(x, y)
        return self

def predict_one(self, x):
        return self.pipeline.predict_one(x)
```

---

### Code for predicting

```python
match = fetch_match(match_id)

model = models.CremeModel.objects.get(name='My awesome model')

duration = model.predict_one(match.raw_info)

match.predicted_ended_at = match.started_at + predicted_duration
match.predicted_by = model
match.save()
```

---

### Code for training

When the match ends, it has a `true_duration` property.

```python
model = models.CremeModel.objects.get(id=match.predicted_by.id)

model.fit_one(
    x=match.raw_info,
    y=match.true_duration.seconds
)

model.save()
```

---

### Code for calculating performance

Just some Django black magic.

```python
duration = ExpressionWrapper(
    Func(F('predicted_ended_at') - F('ended_at'), function='ABS'),
    output_field=fields.DurationField()
)

agg = models.Match.objects.exclude(ended_at__isnull=True)\
                          .annotate(duration=duration)\
                          .aggregate(Avg('duration'))

avg_error = agg['duration__avg']
```

---

### Benefits of online learning

.bullets[
- No need to schedule model training
- Easy to debug and to monitor
- You're not "far" from production
- Way more fun than batch learning
]

---

### Future work

.left-column[
.bullets[
- Decision trees (nearly there!)
- Gradient boosting
- Bayesian linear models
- More feature extraction
- More models
- More benchmarks
- Many issues [on GitHub](https://github.com/creme-ml/creme/issues)
]
]
.right-column[

]

---

### If you want to contribute

- [creme-ml.github.io](https://creme-ml.github.io/)
- [github.com/creme-ml](https://github.com/creme-ml/)
- You can shoot emails to [maxhalford25@gmail.com](mailto:maxhalford25@gmail.com)
- Get in contact if you want to try `creme` and want advice
- Spread the word!

---

# Thanks for listening!

.left-column[
<div align="center" style="margin-top: 50px;">
  <iframe src="https://giphy.com/embed/7zusy37fwKjfy" width="480px" height="343px" frameBorder="0" class="giphy-embed" allowFullScreen></iframe>
</div>
]

.right-column[
<div align="center" style="margin-top: 50px;">
  <img height="343px" src="/img/slides/creme/qr_code.svg" />
</div>
]