Max Halford ツBlog Links BioIncremental machine learning class: center, middle ## Online machine learning with creme ### Max Halford #### 11th of May 2019, Amsterdam ??? Hello! --- ### Outline .bullets[ 1. You're not doing ML the correct way 2. Cool kids do online learning 3. Introducing `creme`, the fresh kid on the block 4. Forecasting League of Legends match durations 5. Future work ] ??? My goal is to convince you that the way most people do machine learning isn't the way to go in production scenarios. Online learning is more suitable and nullifies many issues. I'll introduce a new library me and some friends have been working on, and I'll show you we're using for a pet project trying to forecast the duration of League of Legends matches. --- ### Batch learning .bullets[ 1. Collect features $X$ and labels $Y$ 2. Train a model on $(X, Y)$ 3. Save the model somewhere 4. Load the model to make predictions ] With code: ```python >>> model.fit(X_train, y_train) >>> dump(model, 'model.json') >>> model = load('model.json') >>> y_pred = model.predict(X_test) ``` ??? The most common form of machine learning is called batch learning. It's what you do when you use scikit-learn and do Kaggle competitions. I won't dwell on this too long as I'm pretty sure you're all acquainted with it. In short: load, fit, predict. --- class: center, middle ### Batch machine learning in production ??? There isn't a clear pattern as to how to put into production a batch learning system. Typically this involves training a model periodically, and then serializing it before saving it somewhere. API calls can then be issues to make predictions. This is such a difficult thing to get right that startups and tech giants are building products around it. --- class: center, middle ### Batch machine learning in production is hard ??? Works fine for Kaggle cases and when you have a clear split between train and test. But things are not so smooth in production. --- background-color: #2ac380 class: center, middle, white # Models have to be retrained from scratch with new data --- background-color: #dbaaa8 class: center, middle, white # Ever increasing computational requirements --- background-color: #1f282d class: center, middle, white # Models are static and rot faster than bananas 🍌 --- background-color: #e69138 class: center, middle, white # Features that you develop locally might not be available in real-time ??? This is more subtle. Sometimes it isn't possible to recreate features for past instances. If you don't store information for an instance at every particular point in time, then it isn't possible to reproduce the true state during your feature extraction process. In other words you need a solid data engineering team that cares about time. --- class: center, middle > It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if only he could stop bumping for a moment and think of it. ??? And just like Winnie the Pooh, we're spending too much time banging our heads to be able to think about a better way of doing things. --- background-color: #607bd4 class: middle, white ## Online learning .bigbullets[ - Data comes from a stream - Models learn 1 observation at a time - Observations do not have to be stored - Features and labels are dynamic ] --- background-color: #FF7F50 class: middle, white ## Different names, same thing .bigbullets[ - Online learning - Incremental learning - Sequential learning - Iterative learning - Out-of-core learning ] --- background-color: #008080 class: middle, white ## Applications .bigbullets[ - Time series forecasting - Spam filtering - Recommender systems - Ad placement - Internet of things - Basically, anything event based ] --- class: center, middle ### Online learning in a nutshell --- background-color: #e66868ff class: middle, white ## Why is batch learning so popular? .bigbullets[ - Taught at university - (Bad) habits - Hype - Kaggle - Library availability - Higher accuracy ] --- class: middle .bullets[ - Python library for doing online learning - API heavily inspired from `sklearn` and easy to pick up - Focus on feature extraction, not just learning - First commit in January 2019 - Version `0.1.0` released earlier this week ] --- ### Observations Representing an observation with a `dict` is natural: ```python x = { 'date': dt.datetime(2019, 4, 22), 'price': 42.95, 'shop': 'Ikea' } ``` - Values can be of any type - Feature names can be used instead of array indexes - Python's standard library plays nicely with `dict`s --- ### Targets A target's type depends on the context: ```python # Regression y = 42 # Binary classification y = True # Multi-class classification y = 'setosa' # Multi-output regression y = { height: 29.7, width: 21 } ``` --- ### Streaming data ```python from creme import stream X_y = stream.iter_csv('some/csv/file.csv') for x, y in X_y: print(x, y) ``` - `X_y` is a **generator** and consumes a tiny amount of memory - The point is that we want to **handle data points one by one** - Source depends on your use case (can be Kakfa producer, CSV file, HTTP requests, etc.) --- ### Training with `fit_one` ```python from creme import linear_model from creme import stream X_y = stream.iter_csv('some/csv/file.csv') model = linear_model.LogisticRegression() for x, y in X_y: model.fit_one(x, y) ``` Every `creme` estimator has a `fit_one` method --- ### Predicting with `predict_one` ```python from creme import linear_model from creme import stream X_y = stream.iter_csv('some/csv/file.csv') model = linear_model.LogisticRegression() for x, y in X_y: * y_pred_before = model.predict_one(x) model.fit_one(x, y) * y_pred_after = model.predict_one(x) ``` - Classifiers also have a `predict_proba_one` method - Transformers have a `transform_one` method - Training and predicting phases are inter-leaved --- ### Monitoring performance ```python from creme import linear_model from creme import metrics from creme import stream X_y = stream.iter_csv('some/csv/file.csv') model = linear_model.LogisticRegression() * metric = metrics.Accuracy() for x, y in X_y: y_pred = model.predict_one(x) model.fit_one(x, y) * metric.update(y, y_pred) print(metric) ``` Validation score is available for free! No need to do any cross-validation. You can also use `online_score` from the `model_selection` module. --- ### Composing estimators is easy ```python from creme import linear_model from creme import metrics from creme import preprocessing from creme import stream X_y = stream.iter_csv('some/csv/file.csv') scale = preprocessing.StandardScaler() lin_reg = linear_model.LogisticRegression() * model = scale | lin_reg # Pipeline shorthand metric = metrics.Accuracy() for x, y in X_y: y_pred = model.predict_one(x) model.fit_one(x, y) metric.update(y, y_pred) print(metric) ``` --- ### `creme`'s current modules .left-column[ - `cluster` - `compat` - `compose` - `datasets` - `dummy` - `ensemble` - `feature_extraction` - `feature_selection` - `impute` - `linear_model` ] .right-column[ - `model_selection` - `multiclass` - `naive_bayes` - `optim` - `preprocessing` - `reco` - `stats` - `stream` - `tree` - `utils` ] --- ### Online mean For every incoming $x$, do: 1. $n = n + 1$ 2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$ ```python >>> mean = creme.stats.Mean() >>> mean.update(5) >>> mean.get() 5 >>> mean.update(10) >>> mean.get() 7.5 ``` --- ### Online variance For every incoming $x$, do: 1. $n = n + 1$ 2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$ 3. $s\_{i+1} = s\_i + (x - \mu\_{i}) \times (x - \mu\_{i+1})$ (running sum of squares) 4. $\sigma\_{i+1} = \frac{s\_{i+1}}{n}$ ```python >>> variance = creme.stats.Variance() >>> X = [2, 3, 4, 5] >>> for x in X: ... variance.update(x) >>> variance.get() 1.25 >>> numpy.var(X) 1.25 ``` ??? This is called Welford's algorithm, it can be extended to skew and kurtosis --- ### Standard scaling With the mean and the variance, we can scale incoming data so that it has zero mean and unit variance. ```python >>> scaler = creme.preprocessing.StandardScaler() >>> for x in [2, 3, 4, 5]: ... print(x, 'becomes', scaler.fit_one(x)['x']) 2 becomes 0.0 3 becomes 0.9999999999999996 4 becomes 1.224744871391589 5 becomes 1.3416407864998738 ``` --- ### Linear regression (1) Model is $y_t = \langle w_t x_t \rangle + b_t$. The weights $w_t$ can be learnt with any online gradient descent algorithm, for example: - Stochastic gradient descent (SGD) - Adam - RMSProp - Follow the Regularized Leader (FTRL) ```python from creme import linear_model from creme import optim lin_reg = linear_model.LinearRegression( optimizer=optim.Adam(lr=0.01) ) ``` --- ### Linear regression (2) The intercept term $b_t$ is difficult to learn Some people (Léon Bottou, scikit-learn) suggest to use a lower learning rate for the intercept than for the weights (heuristic but okay) `creme` uses any running statistic from the `creme.stats` module, which is a powerful trick ```python from creme import linear_model from creme import optim from creme import stats lin_reg = linear_model.LinearRegression( optimizer=optim.Adam(lr=0.01), intercept=stats.RollingMean(42) ) ``` --- ### Online aggregations ```python >>> import creme >>> X = [ ... {'place': 'Taco Bell', 'revenue': 42}, ... {'place': 'Burger King', 'revenue': 16}, ... {'place': 'Burger King', 'revenue': 24}, ... {'place': 'Taco Bell', 'revenue': 58} ... ] >>> agg = creme.feature_extraction.Agg( ... on='revenue', ... by='place', ... how=creme.stats.Mean() ... ) >>> for x in X: ... print(agg.fit_one(x).transform_one(x)) {'revenue_mean_by_place': 42.0} {'revenue_mean_by_place': 16.0} {'revenue_mean_by_place': 20.0} {'revenue_mean_by_place': 50.0} ``` --- ### Bagging (1) Bagging is a popular and simple ensemble algorithm: 1. Pick $m$ base models (usually identical copies) 2. Train each model on a sample with replacement 3. Average the predictions of each model on the test set Each observation is sampled $K$ times where $K$ follows a binomial distribution: $$P(K=k) = {n \choose k} \times (\frac{1}{n})^k \times (1 - \frac{1}{n})^{n-k}$$ As $n$ grows towards infinity, $K$ can be approximated by a Poisson(1): $$P(K=k) \sim \frac{e^{-1}}{k!} $$ --- ### Bagging (2) `ensemble.BaggingClassifier` is very simple: ```python def fit_one(self, x, y): for estimator in self.estimators: for _ in range(self.rng.poisson(1)): estimator.fit_one(x, y) return self def predict_proba_one(self, x): y_pred = statistics.mean( estimator.predict_proba_one(x)[True] for estimator in self.estimators ) return { True: y_pred, False: 1 - y_pred } ``` --- ### League of Legends match duration forecasting (1) --- ### League of Legends match duration forecasting (2) --- ### Architecture --- class: middle ### Django model ```python from django.db import models from picklefield.fields import PickledObjectField class CremeModel(models.Model): name = models.TextField(unique=True) pipeline = PickledObjectField() class Meta: db_table = 't_models' verbose_name_plural = 'models' def fit_one(self, x, y): self.pipeline.fit_one(x, y) return self def predict_one(self, x): return self.pipeline.predict_one(x) ``` --- class: middle ### Code for predicting ```python match = fetch_match(match_id) model = models.CremeModel.objects.get(name='My awesome model') duration = model.predict_one(match.raw_info) match.predicted_ended_at = match.started_at + predicted_duration match.predicted_by = model match.save() ``` --- class: middle ### Code for training When the match ends, it has a `true_duration` property. ```python model = models.CremeModel.objects.get(id=match.predicted_by.id) model.fit_one( x=match.raw_info, y=match.true_duration.seconds ) model.save() ``` --- class: middle ### Code for calculating performance Just some Django black magic. ```python duration = ExpressionWrapper( Func(F('predicted_ended_at') - F('ended_at'), function='ABS'), output_field=fields.DurationField() ) agg = models.Match.objects.exclude(ended_at__isnull=True)\ .annotate(duration=duration)\ .aggregate(Avg('duration')) avg_error = agg['duration__avg'] ``` --- ### Benefits of online learning .bullets[ - No need to schedule model training - Easy to debug and to monitor - You're not "far" from production - Way more fun than batch learning ] --- ### Future work .left-column[ .bullets[ - Decision trees (nearly there!) - Gradient boosting - Bayesian linear models - More feature extraction - More models - More benchmarks - Many issues [on GitHub](https://github.com/creme-ml/creme/issues) ] ] .right-column[ ] --- ### If you want to contribute - [creme-ml.github.io](https://creme-ml.github.io/) - [github.com/creme-ml](https://github.com/creme-ml/) - You can shoot emails to [maxhalford25@gmail.com](mailto:maxhalford25@gmail.com) - Get in contact if you want to try `creme` and want advice - Spread the word! --- class: center, middle # Thanks for listening! .left-column[ ] .right-column[ ]