Max Halford ใBlog Links BioMachine learning incrรฉmental: des concepts ร la pratique class: center, middle ## Machine learning incrรฉmental: des concepts ร la pratique ### Max Halford #### 28 mai 2019 #### Toulouse Data Science Meetup ??? Hello! --- ### Outline .bullets[ 1. You're doing machine learning the wrong way ๐ฑ 2. Cool kids do online learning ๐ 3. Introducing `creme`, a Python lib for online learning ๐ 4. Bike stations forecasting demo ๐ฒ ๐ฎ ] --- class: center, middle ### PyData Amsterdam 2019 ๐ ๐ณ๐ฑ ๐ง --- ### Batch learning .bullets[ 1. Collect features $X$ and labels $Y$ 2. Train a model on $(X, Y)$ 3. Save the model somewhere 4. Load the model to make predictions ] With code: ```python >>> model.fit(X_train, y_train) >>> dump(model, 'model.json') >>> model = load('model.json') >>> y_pred = model.predict(X_test) ``` --- class: center, middle ### Batch machine learning in production --- background-color: #2ac380 class: center, middle, white # Models have to be retrained from scratch with new data โฐโฐโฐ --- background-color: #663399 class: center, middle, white # Models needs increasing amounts of power ๐ --- background-color: #1f282d class: center, middle, white # Models are static and "rot" faster than bananas ๐ --- background-color: #e69138 class: center, middle, white # Models that work locally don't always work in production ๐ญ --- class: center, middle > It is, as far as he knows, the only way of coming downstairs, but sometimes he feels that there really is another way, if only he could stop bumping for a moment and think of it. ??? And just like Winnie the Pooh, we're spending too much time banging our heads to be able to think about a better way of doing things. --- background-color: #607bd4 class: middle, white ## Online learning .bigbullets[ - Data comes from a stream - Models learn 1 observation at a time - Features and labels are dynamic ] --- ## Everything changes ๐ฅ --- background-color: #008080 class: middle, white ## Different names, same thing ๐คท .bigbullets[ - Online learning - Incremental learning - Sequential learning - Iterative learning - Continuous learning - Out-of-core learning ] --- background-color: #FF7F50 class: middle, white ## Applications .bigbullets[ - Time series forecasting - Spam filtering - Recommender systems - Ad placement - Internet of things - Basically, anything event based ] --- class: center, middle ### Online learning in a nutshell --- background-color: #e66868ff class: middle, white ## Why is batch learning so popular? .bigbullets[ - Taught at university ๐ - (Bad) habits - Hype - Kaggle ๐ฏ - Library availability ] --- class: center, middle # Questions? --- class: middle .bullets[ - Online machine learning library for Python ๐ - Easy to pick up API inspired by `sklearn` - Written with production scenarios in mind - First commit in January 2019 - Version `0.2.0` released yesterday ] --- #### scikit-learn ```python from sklearn import datasets from sklearn import linear_model X, y = datasets.load_boston(return_X_y=True) model = linear_model.LinearRegression() model.fit(X, y) ``` #### creme ```python from creme import linear_model from creme import stream from sklearn import datasets X_y = stream.iter_sklearn_dataset(datasets.load_boston) model = linear_model.LinearRegression() for x, y in X_y: model.fit_one(x, y) ``` --- class: middle ### Features Representing a set of features using a `dict` is natural: ```python x = { 'date': dt.datetime(2019, 4, 22), 'price': 42.95, 'shop': 'Ikea' } ``` - Values can be of any type - Feature names can be used instead of array indexes - Python's standard library plays nicely with `dict`s --- class: middle ### Targets A target's type depends on the context: ```python # Regression y = 42 # Binary classification y = True # Multi-class classification y = 'setosa' # Multi-output regression y = { height: 29.7, width: 21 } ``` --- class: middle ### Streaming data ```python from creme import stream X_y = stream.iter_csv('some/csv/file.csv') for x, y in X_y: print(x, y) ``` - `X_y` is a **generator** and consumes a tiny amount of memory - The point is that we only need one data point at a time - Source depends on your use case (CSV file, Kafka consumer, HTTP requests) --- class: middle ### Training with `fit_one` ```python from creme import linear_model from creme import stream X_y = stream.iter_csv('some/csv/file.csv') model = linear_model.LogisticRegression() for x, y in X_y: * model.fit_one(x, y) ``` Every `creme` estimator has a `fit_one` method --- class: middle ### Predicting with `predict_one` ```python from creme import linear_model from creme import stream X_y = stream.iter_csv('some/csv/file.csv') model = linear_model.LogisticRegression() for x, y in X_y: * y_pred = model.predict_one(x) model.fit_one(x, y) ``` - Classifiers also have a `predict_proba_one` method - Transformers have a `transform_one` method - Training and predicting phases are inter-leaved --- class: middle ### Progressive validation ๐ฏ ```python from creme import linear_model from creme import metrics from creme import stream X_y = stream.iter_csv('some/csv/file.csv') model = linear_model.LogisticRegression() metric = metrics.Accuracy() for x, y in X_y: y_pred = model.predict_one(x) model.fit_one(x, y) * metric.update(y, y_pred) print(metric) ``` Validation score is available for free! No need for cross-validation. You can also use `online_score` from the `model_selection` module. --- class: middle ### Composing estimators is easy ```python from creme import compose from creme import linear_model from creme import preprocessing scale = preprocessing.StandardScaler() lin_reg = linear_model.LogisticRegression() # You can do this... model = compose.Pipeline([ ('scale', scale), ('lin_reg', lin_reg ]) # Or this... model = scale | lin_reg ``` --- class: center, middle # Questions? --- ### Online mean For every incoming $x$, do: 1. $n = n + 1$ 2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$ ```python >>> mean = creme.stats.Mean() >>> mean.update(5) >>> mean.get() 5 >>> mean.update(10) >>> mean.get() 7.5 ``` --- ### Online variance For every incoming $x$, do: 1. $n = n + 1$ 2. $\mu\_{i+1} = \mu\_{i} + \frac{x - \mu\_{i}}{n}$ 3. $s\_{i+1} = s\_i + (x - \mu\_{i}) \times (x - \mu\_{i+1})$ 4. $\sigma\_{i+1} = \frac{s\_{i+1}}{n}$ ```python >>> variance = creme.stats.Variance() >>> X = [2, 3, 4, 5] >>> for x in X: ... variance.update(x) >>> variance.get() 1.25 >>> numpy.var(X) 1.25 ``` ??? This is called Welford's algorithm, it can be extended to skew and kurtosis --- ### Standard scaling Using the mean and the variance, we can rescale incoming data. ```python >>> scaler = creme.preprocessing.StandardScaler() >>> for x in [2, 3, 4, 5]: ... features = {'x': x} ... scaler.fit_one(features) ... new_x = scaler.transform_one(features)['x'] ... print(f'{x} becomes {new_x}) 2 becomes 0.0 3 becomes 0.9999999999999996 4 becomes 1.224744871391589 5 becomes 1.3416407864998738 ``` --- ### Linear regression (1) Model is $y_t = \langle w_t x_t \rangle + b_t$. The weights $w_t$ can be learnt with any online gradient descent algorithm, for example: - Stochastic gradient descent (SGD) - Adam - RMSProp - Follow the Regularized Leader (FTRL) ```python from creme import linear_model from creme import optim lin_reg = linear_model.LinearRegression( optimizer=optim.Adam(lr=0.01) ) ``` --- ### Linear regression (2) Some people (Lรฉon Bottou, scikit-learn) suggest to use a lower learning rate for the intercept than for the weights (heuristic but okay) `creme` uses any running statistic from the `creme.stats` module, which is a powerful trick ```python from creme import linear_model from creme import optim from creme import stats lin_reg = linear_model.LinearRegression( optimizer=optim.Adam(lr=0.01), intercept=stats.RollingMean(42) ) ``` --- ### Online aggregations ```python >>> import creme >>> X = [ ... {'meal': ๐, 'sales': 42}, ... {'meal': ๐, 'sales': 16}, ... {'meal': ๐, 'sales': 24}, ... {'meal': ๐, 'sales': 58} ... ] >>> agg = creme.feature_extraction.Agg( ... on='sales', ... by='meal', ... how=creme.stats.Mean() ... ) >>> for x in X: ... print(agg.fit_one(x).transform_one(x)) {'sales_mean_by_meal': 42.0} {'sales_mean_by_meal': 16.0} {'sales_mean_by_meal': 20.0} {'sales_mean_by_meal': 50.0} ``` --- ### Bagging (1) Each observation is sampled $K$ times where $K$ follows a binomial distribution: $$P(K=k) = {n \choose k} \times (\frac{1}{n})^k \times (1 - \frac{1}{n})^{n-k}$$ As $n$ grows towards infinity, $K$ can be approximated by a Poisson(1): $$P(K=k) \sim \frac{e^{-1}}{k!} $$ This leads to a simple and efficient online algorithm. --- ### Bagging (2) `ensemble.BaggingClassifier` is very simple: ```python def fit_one(self, x, y): for estimator in self.estimators: for _ in range(self.rng.poisson(1)): estimator.fit_one(x, y) return self def predict_proba_one(self, x): y_pred = statistics.mean( estimator.predict_proba_one(x)[True] for estimator in self.estimators ) return { True: y_pred, False: 1 - y_pred } ``` --- ### Decision trees ๐ณ - A version of Hoeffding trees is being implemented - Basic idea: - Start with a leaf ๐ - Find the leaf where an observation belongs ๐ - Update the leaf's sufficient statistics ๐ - Measure information gain every so often ๐ฌ - Split when the information gain is good enough ๐ - Mondrian trees ๐จโ๐จ are another possibility but they only work for continuous attributes --- class: center, middle # Questions? --- ### `creme`'s current modules cluster compat compose datasets dummy ensemble feature_extraction feature_selection impute linear_model model_selection multiclass naive_bayes optim plot preprocessing proba reco stats stream tree utils --- ### Cool stuff in `creme` we skipped ๐ข .bullets[ - Clustering - Factorization machines - Feature selection - Passive-aggressive models - Recommender systems - Histograms - Skyline queries - Fourier transforms - Imputation - Naive Bayes ] --- ### Alternative frameworks --- ### Benefits of online learning .bullets[ - No need to schedule model training - Easy to monitor - You're very close to production - Way more fun than batch learning ] --- ### Current work .bullets[ - Decision trees (nearly there) - Gradient boosting (easy) - Bayesian linear models (part of my PhD) - Latent Dirichlet Allocation (ask Raphael) - Many issues [on GitHub](https://github.com/creme-ml/creme/issues) ] --- ### What next? - [creme-ml.github.io](https://creme-ml.github.io/) - [github.com/creme-ml](https://github.com/creme-ml/) - You can send emails to [maxhalford25@gmail.com](mailto:maxhalford25@gmail.com) - Get in touch if you want help and/or advice - Starring us on GitHub helps a lot ๐ --- class: center, middle # Thanks for listening! .left-column[ ] .right-column[ ]