An introduction to symbolic regression
Max Halford - PhD student IRIT/IMT
Toulouse Data Science Meetup - December 2017
.center[
.left-column[]
.right-column[]
]
---
layout: true
# Symbolic regression
---
## Quick overview
- The goal is to evolve "programs" with selection, mutation, and crossover
- Selection keeps programs that perform well
- Mutation changes a piece of the program
- Crossover combines two programs
---

---
## Example programs

---
## Kaggle Titanic top 1% 🚢
2 years ago [scirpus](https://www.kaggle.com/scirpus) posted a [solution](https://www.kaggle.com/scirpus/genetic-programming-lb-0-88) to the [Kaggle Titanic competiton](https://www.kaggle.com/c/titanic)
```python
y_raw = ((np.minimum(((((0.058823499828577 + X['Sex']) - np.cos((X['Pclass'] / 2.0))) * 2.0)), ((0.885868))) * 2.0) +np.maximum(((X['SibSp'] - 2.409090042114258)), (-(np.minimum((X['Sex']), (np.sin(X['Parch']))) * X['Pclass']))) +(0.138462007045746 * ((np.minimum((X['Sex']), (((X['Parch'] / 2.0) / 2.0))) * X['Age']) - X['Cabin'])) +np.minimum(((np.sin((X['Parch'] * ((X['Fare'] - 0.720430016517639) * 2.0))) * 2.0)), ((X['SibSp'] / 2.0))) +np.maximum((np.minimum((-np.cos(X['Embarked'])), (0.138462007045746))), (np.sin(((X['Cabin'] - X['Fare']) * 2.0)))) +-np.minimum(((((X['Age'] * X['Parch']) * X['Embarked']) + X['Parch'])), (np.sin(X['Pclass']))) +np.minimum((X['Sex']), ((np.sin(-(X['Fare'] * np.cos((X['Fare'] * 1.630429983139038)))) / 2.0))) +np.minimum(((0.230145)), (np.sin(np.minimum((((67.0 / 2.0) * np.sin(X['Fare']))), (0.31830988618379069))))) +np.sin((np.sin(X['Cabin']) * (np.sin((12.6275)) * np.maximum((X['Age']), (X['Fare']))))) +np.sin(((np.minimum((X['Fare']), ((X['Cabin'] * X['Embarked']))) / 2.0) * -X['Fare'])) +np.minimum((((2.675679922103882 * X['SibSp']) * np.sin(((96) * np.sin(X['Cabin']))))), (X['Parch'])) +np.sin(np.sin((np.maximum((np.minimum((X['Age']), (X['Cabin']))), ((X['Fare'] * 0.31830988618379069))) * X['Cabin']))) +np.maximum((np.sin(((12.4148) * (X['Age'] / 2.0)))), (np.sin((-3.0 * X['Cabin'])))) +(np.minimum((np.sin((((np.sin(((X['Fare'] * 2.0) * 2.0)) * 2.0) * 2.0) * 2.0))), (X['SibSp'])) / 2.0) +((X['Sex'] - X['SibSp']) * (np.cos(((X['Embarked'] - 0.730768978595734) + X['Age'])) / 2.0)) +((np.sin(X['Cabin']) / 2.0) - (np.cos(np.minimum((X['Age']), (X['Embarked']))) * np.sin(X['Embarked']))) +np.minimum((0.31830988618379069), ((X['Sex'] * (2.212120056152344 * (0.720430016517639 - np.sin((X['Age'] * 2.0))))))) +(np.minimum((np.cos(X['Fare'])), (np.maximum((np.sin(X['Age'])), (X['Parch'])))) * np.cos((X['Fare'] / 2.0))) +np.sin((X['Parch'] * np.minimum(((X['Age'] - 1.5707963267948966)), ((np.cos((X['Pclass'] * 2.0)) / 2.0))))) +(X['Parch'] * (np.sin(((X['Fare'] * (0.623655974864960 * X['Age'])) * 2.0)) / 2.0)) +(0.31830988618379069 * np.cos(np.maximum(((0.602940976619720 * X['Fare'])), ((np.sin(0.720430016517639) * X['Age']))))) +(np.minimum(((X['SibSp'] / 2.0)), (np.sin(((X['Pclass'] - X['Fare']) * X['SibSp'])))) * X['SibSp']) +np.tanh((X['Sex'] * np.sin((5.199999809265137 * np.sin((X['Cabin'] * np.cos(X['Fare']))))))) +(np.minimum((X['Parch']), (X['Sex'])) * np.cos(np.maximum(((np.cos(X['Parch']) + X['Age'])), (3.1415926535897931)))) +(np.minimum((np.tanh(((X['Cabin'] / 2.0) + X['Parch']))), ((X['Sex'] + np.cos(X['Age'])))) / 2.0) +(np.sin((np.sin(X['Sex']) * (np.sin((X['Age'] * X['Pclass'])) * X['Pclass']))) / 2.0) +(X['Sex'] * (np.cos(((X['Sex'] + X['Fare']) * ((8.48635) * (63)))) / 2.0)) +np.minimum((X['Sex']), ((np.cos((X['Age'] * np.tanh(np.sin(np.cos(X['Fare']))))) / 2.0))) +(np.tanh(np.tanh(-np.cos((np.maximum((np.cos(X['Fare'])), (0.094339601695538)) * X['Age'])))) / 2.0) +(np.tanh(np.cos((np.cos(X['Age']) + (X['Age'] + np.minimum((X['Fare']), (X['Age'])))))) / 2.0) +(np.tanh(np.cos((X['Age'] * ((-2.0 + np.sin(X['SibSp'])) + X['Fare'])))) / 2.0) +(np.minimum((((281) - X['Fare'])), (np.sin((np.maximum(((176)), (X['Fare'])) * X['SibSp'])))) * 2.0) +np.sin(((np.maximum((X['Embarked']), (X['Age'])) * 2.0) * (((785) * 3.1415926535897931) * X['Age']))) +np.minimum((X['Sex']), (np.sin(-(np.minimum(((X['Cabin'] / 2.0)), (X['SibSp'])) * (X['Fare'] / 2.0))))) +np.sin(np.sin((X['Cabin'] * (X['Embarked'] + (np.tanh(-X['Age']) + X['Fare']))))) +(np.cos(np.cos(X['Fare'])) * (np.sin((X['Embarked'] - ((734) * X['Fare']))) / 2.0)) +((np.minimum((X['SibSp']), (np.cos(X['Fare']))) * np.cos(X['SibSp'])) * np.sin((X['Age'] / 2.0))) +(np.sin((np.sin((X['SibSp'] * np.cos((X['Fare'] * 2.0)))) + (X['Cabin'] * 2.0))) / 2.0) +(((X['Sex'] * X['SibSp']) * np.sin(np.sin(-(X['Fare'] * X['Cabin'])))) * 2.0) +(np.sin((X['SibSp'] * ((((5.428569793701172 + 67.0) * 2.0) / 2.0) * X['Age']))) / 2.0) +(X['Pclass'] * (np.sin(((X['Embarked'] * X['Cabin']) * (X['Age'] - (1.07241)))) / 2.0)) +(np.cos(((((-X['SibSp'] + X['Age']) + X['Parch']) * X['Embarked']) / 2.0)) / 2.0) +(0.31830988618379069 * np.sin(((X['Age'] * ((X['Embarked'] * np.sin(X['Fare'])) * 2.0)) * 2.0))) +((np.minimum(((X['Age'] * 0.058823499828577)), (X['Sex'])) - 0.63661977236758138) * np.tanh(np.sin(X['Pclass']))) +-np.minimum(((np.cos(((727) * ((X['Fare'] + X['Parch']) * 2.0))) / 2.0)), (X['Fare'])) +(np.minimum((np.cos(X['Fare'])), (X['SibSp'])) * np.minimum((np.sin(X['Parch'])), (np.cos((X['Embarked'] * 2.0))))) +(np.minimum((((X['Fare'] / 2.0) - 2.675679922103882)), (0.138462007045746)) * np.sin((1.5707963267948966 * X['Age']))) +np.minimum(((0.0821533)), (((np.sin(X['Fare']) + X['Embarked']) - np.cos((X['Age'] * (9.89287)))))))
y_pred = 1 / (1 + np.exp(-y_raw))
```
0.88516 (top 1% as of today) on the public leaderboard 😵
---
## Types of nodes
- Constants
- Variables
- Functions
Huge search space because the shape of the model is included in the search space 🙊
---
## Algorithm (high-level) ➰
```python
programs = make_random_programs()
for i in range(generations):
evaluate(programs)
new_programs = select(programs)
crossover(new_programs)
mutate(new_programs)
programs = new_programs
```
---
## Mutation
---
## Crossover 💏
---
## Evaluation 💯
---
## Selection
---
## Initialization
---
## Pros 😺
- Flexible model
- Built-in feature selection
- Can optimise non-differentiable metrics
- Useful for stacking
- Makes you look cool 😎
---
## Cons 😿
- Black box model
- No mathematical foundation
- Requires a lot of CPU power
- Non-deterministic and volatile
---
layout: true
.center[]
---
## Enter XGP 🎉
- Written in [Go](https://golang.org/)
- Optimization done with [gago](https://github.com/MaxHalford/gago)
- SIMD operations thanks to [gonum](https://www.gonum.org/) ⚡
- Readable code 📖
- Very customizable (with sensible defaults!)
- Command line interface (CLI) 💻
- Python bindings (scikit-learn API) 🐍
---
## Command line interface (CLI) 💻
```sh
$ xgp fit train.csv
```
```sh
$ xgp predict test.csv
```
---
## Python package 🐍
### Regression 📈
```python
model = xgp.XGPRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
```
---
## Python package 🐍
### Feature extraction 🔬
```py
model = xgp.XGPTransformer()
model.fit(X_train, y_train)
train_gp_features = model.predict(X_train)
test_gp_features = model.predict(X_test)
```
---
## Future work 🔮
- Binary and multi-class classification
- Boosting (promising)
- Caching to handle large datasets
- More bindings (R, Ruby, ...)
- Extensive testing
- Documentation and examples
---
layout: false
class: center, middle
# Thanks! ✌
[github.com/MaxHalford](https://github.com/MaxHalford/)