• Max Halford

Machine learning incrémental: des concepts à la pratique

## Machine learning incrémental: des concepts à la pratique

### Max Halford

#### 28 mai 2019

#### Toulouse Data Science Meetup

</div>

???

Hello!

---

### Outline

.bullets[
1. You're doing machine learning the wrong way 😱
2. Cool kids do online learning 😎
3. Introducing `creme`, a Python lib for online learning 🐍
4. Bike stations forecasting demo 🚲 🔮
]

---

### PyData Amsterdam 2019 🐍 🇳🇱 🧀

---

### Batch learning

.bullets[
1. Collect features $X$ and labels $Y$
2. Train a model on $(X, Y)$
3. Save the model somewhere
4. Load the model to make predictions
]

With code:

```python
>>> model.fit(X_train, y_train)
>>> dump(model, 'model.json')
>>> model = load('model.json')
>>> y_pred = model.predict(X_test)
```
---

### Batch machine learning in production

---

background-color: #2ac380
class: center, middle, white

# Models have to be retrained from scratch with new data ➰➰➰

---

background-color: #663399
class: center, middle, white

# Models needs increasing amounts of power 🔌

---

background-color: #1f282d
class: center, middle, white

# Models are static and "rot" faster than bananas 🍌

---

background-color: #e69138
class: center, middle, white

# Models that work locally don't always work in production 😭

---

> It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if only he could stop bumping for a moment and think of it.

???

And just like Winnie the Pooh, we're spending too much time banging our heads to be able to think about a better way of doing things.

---

background-color: #607bd4
class: middle, white

## Online learning

.bigbullets[
- Data comes from a stream
- Models learn 1 observation at a time
- Features and labels are dynamic
]

---

## Everything changes 💥

---

background-color: #008080
class: middle, white

## Different names, same thing 🤷

.bigbullets[
- Online learning
- Incremental learning
- Sequential learning
- Iterative learning
- Continuous learning
- Out-of-core learning
]

---

background-color: #FF7F50
class: middle, white

## Applications

.bigbullets[
- Time series forecasting
- Spam filtering
- Recommender systems
- Ad placement
- Internet of things
- Basically, <span style="text-decoration:underline">anything event based</span>
]

---

### Online learning in a nutshell

---

background-color: #e66868ff
class: middle, white

## Why is batch learning so popular?

---

# Questions?

---

.bullets[
- Online machine learning library for Python 🐍
- Easy to pick up API inspired by `sklearn`
- Written with production scenarios in mind
- First commit in January 2019
- Version `0.2.0` released yesterday
]

---

#### scikit-learn

```python
from sklearn import datasets
from sklearn import linear_model

X, y = datasets.load_boston(return_X_y=True)
model = linear_model.LinearRegression()

model.fit(X, y)
```

#### creme

```python
from creme import linear_model
from creme import stream
from sklearn import datasets

X_y = stream.iter_sklearn_dataset(datasets.load_boston)
model = linear_model.LinearRegression()

for x, y in X_y:
    model.fit_one(x, y)
```

---

### Features

Representing a set of features using a `dict` is natural:

```python
x = {
    'date': dt.datetime(2019, 4, 22),
    'price': 42.95,
    'shop': 'Ikea'
}
```

- Values can be of any type
- Feature names can be used instead of array indexes
- Python's standard library plays nicely with `dict`s

---

### Targets

A target's type depends on the context:

```python
# Regression
y = 42

# Binary classification
y = True

# Multi-class classification
y = 'setosa'

# Multi-output regression
y = {
   height: 29.7,
   width: 21
}
```

---

### Streaming data

```python
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

for x, y in X_y:
    print(x, y)
```

- `X_y` is a **generator** and consumes a tiny amount of memory
- The point is that we only need one data point at a time
- Source depends on your use case (CSV file, Kafka consumer, HTTP requests)

---

### Training with `fit_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
*   model.fit_one(x, y)
```

Every `creme` estimator has a `fit_one` method

---

### Predicting with `predict_one`

```python
from creme import linear_model
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

for x, y in X_y:
*   y_pred = model.predict_one(x)
    model.fit_one(x, y)
```

- Classifiers also have a `predict_proba_one` method
- Transformers have a `transform_one` method
- Training and predicting phases are inter-leaved

---

### Progressive validation 💯

```python
from creme import linear_model
from creme import metrics
from creme import stream

X_y = stream.iter_csv('some/csv/file.csv')

model = linear_model.LogisticRegression()

metric = metrics.Accuracy()

for x, y in X_y:
    y_pred = model.predict_one(x)
    model.fit_one(x, y)
*   metric.update(y, y_pred)
    print(metric)
```

Validation score is available for free! No need for cross-validation. You can also use `online_score` from the `model_selection` module.

---

### Composing estimators is easy

```python
from creme import compose
from creme import linear_model
from creme import preprocessing

scale = preprocessing.StandardScaler()
lin_reg = linear_model.LogisticRegression()

# You can do this...
model = compose.Pipeline([
    ('scale', scale),
    ('lin_reg', lin_reg
])

# Or this...
model = scale | lin_reg
```

---

# Questions?

---

### Online mean

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$

```python
>>> mean = creme.stats.Mean()

>>> mean.update(5)
>>> mean.get()
5

>>> mean.update(10)
>>> mean.get()
7.5
```

---

### Online variance

For every incoming $x$, do:

1. $n = n + 1$
2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$
3. $s\_{i+1} = s\_i + (x - \mu\_{i}) \times (x - \mu\_{i+1})$
4. $\sigma\_{i+1} = \frac{s\_{i+1}}{n}$

```python
>>> variance = creme.stats.Variance()

>>> X = [2, 3, 4, 5]
>>> for x in X:
...     variance.update(x)
>>> variance.get()
1.25

>>> numpy.var(X)
1.25

```

???

This is called Welford's algorithm, it can be extended to skew and kurtosis

---

### Standard scaling

Using the mean and the variance, we can rescale incoming data.

```python
>>> scaler = creme.preprocessing.StandardScaler()

>>> for x in [2, 3, 4, 5]:
...     features = {'x': x}
...     scaler.fit_one(features)
...     new_x = scaler.transform_one(features)['x']
...     print(f'{x} becomes {new_x})
2 becomes 0.0
3 becomes 0.9999999999999996
4 becomes 1.224744871391589
5 becomes 1.3416407864998738

```

---

### Linear regression (1)

Model is $y_t = \langle w_t x_t \rangle + b_t$. The weights $w_t$ can be learnt with any online gradient descent algorithm, for example:

- Stochastic gradient descent (SGD)
- Adam
- RMSProp
- Follow the Regularized Leader (FTRL)

```python
from creme import linear_model
from creme import optim

lin_reg = linear_model.LinearRegression(
    optimizer=optim.Adam(lr=0.01)
)
```

---

### Linear regression (2)

Some people (Léon Bottou, scikit-learn) suggest to use a lower learning rate for the intercept than for the weights (heuristic but okay)

`creme` uses any running statistic from the `creme.stats` module, which is a powerful trick

```python
from creme import linear_model
from creme import optim
from creme import stats

lin_reg = linear_model.LinearRegression(
    optimizer=optim.Adam(lr=0.01),
    intercept=stats.RollingMean(42)
)
```

---

### Online aggregations

```python
>>> import creme

>>> X = [
...     {'meal': 🍕, 'sales': 42},
...     {'meal': 🍔, 'sales': 16},
...     {'meal': 🍔, 'sales': 24},
...     {'meal': 🍕, 'sales': 58}
... ]

>>> agg = creme.feature_extraction.Agg(
...     on='sales',
...     by='meal',
...     how=creme.stats.Mean()
... )

>>> for x in X:
...     print(agg.fit_one(x).transform_one(x))
{'sales_mean_by_meal': 42.0}
{'sales_mean_by_meal': 16.0}
{'sales_mean_by_meal': 20.0}
{'sales_mean_by_meal': 50.0}
```

---

### Bagging (1)

Each observation is sampled $K$ times where $K$ follows a binomial distribution:

$$P(K=k) = {n \choose k} \times (\frac{1}{n})^k \times (1 - \frac{1}{n})^{n-k}$$

As $n$ grows towards infinity, $K$ can be approximated by a Poisson(1):

$$P(K=k) \sim \frac{e^{-1}}{k!} $$

This leads to a simple and efficient online algorithm.

---

### Bagging (2)

`ensemble.BaggingClassifier` is very simple:

```python
def fit_one(self, x, y):

for estimator in self.estimators:
        for _ in range(self.rng.poisson(1)):
            estimator.fit_one(x, y)

return self

def predict_proba_one(self, x):
    y_pred = statistics.mean(
        estimator.predict_proba_one(x)[True]
        for estimator in self.estimators
    )
    return {
        True: y_pred,
        False: 1 - y_pred
    }
```

---

### Decision trees 🌳

- A version of Hoeffding trees is being implemented
- Basic idea:
  - Start with a leaf 🍃
  - Find the leaf where an observation belongs 🔎
  - Update the leaf's sufficient statistics 📊
  - Measure information gain every so often 🔬
  - Split when the information gain is good enough 🍂
- Mondrian trees 👨‍🎨 are another possibility but they only work for continuous attributes

---

# Questions?

---

### `creme`'s current modules

<div style="display: flex; justify-content: space-around;">
  <ul>
    <li><pre>cluster</pre></li>
    <li><pre>compat</pre></li>
    <li><pre>compose</pre></li>
    <li><pre>datasets</pre></li>
    <li><pre>dummy</pre></li>
    <li><pre>ensemble</pre></li>
    <li><pre>feature_extraction</pre></li>
  </ul>
  <ul>
    <li><pre>feature_selection</pre></li>
    <li><pre>impute</pre></li>
    <li><pre>linear_model</pre></li>
    <li><pre>model_selection</pre></li>
    <li><pre>multiclass</pre></li>
    <li><pre>naive_bayes</pre></li>
    <li><pre>optim</pre></li>
    <li><pre>plot</pre></li>
  </ul>
  <ul>
    <li><pre>preprocessing</pre></li>
    <li><pre>proba</pre></li>
    <li><pre>reco</pre></li>
    <li><pre>stats</pre></li>
    <li><pre>stream</pre></li>
    <li><pre>tree</pre></li>
    <li><pre>utils</pre></li>
  </ul>

</div>

---

### Cool stuff in `creme` we skipped 😢

.bullets[
- Clustering
- Factorization machines
- Feature selection
- Passive-aggressive models
- Recommender systems
- Histograms
- Skyline queries
- Fourier transforms
- Imputation
- Naive Bayes
]

---

### Alternative frameworks

---

### Benefits of online learning

.bullets[
- No need to schedule model training
- Easy to monitor
- You're very close to production
- Way more fun than batch learning
]

---

### Current work

.bullets[
- Decision trees (nearly there)
- Gradient boosting (easy)
- Bayesian linear models (part of my PhD)
- Latent Dirichlet Allocation (ask Raphael)
- Many issues [on GitHub](https://github.com/creme-ml/creme/issues)
]

---

### What next?

- [creme-ml.github.io](https://creme-ml.github.io/)
- [github.com/creme-ml](https://github.com/creme-ml/)
- You can send emails to [maxhalford25@gmail.com](mailto:maxhalford25@gmail.com)
- Get in touch if you want help and/or advice
- Starring us on GitHub helps a lot 🌟

---

# Thanks for listening!

.left-column[
<div align="center" style="margin-top: 50px;">
  <iframe src="https://giphy.com/embed/DUrdT2xEmJWbS" width="400px" height="400px" frameBorder="0" class="giphy-embed" allowFullScreen></iframe>
</div>
]

.right-column[
<div align="center" style="margin-top: 50px;">
  <img height="400px" src="/img/slides/creme/qr_code.svg" />
</div>
]