During my undergraduate internship in 2015 I started a side project called OpenBikes. The idea was to visualize and analyze bike sharing over multiple cities. Axel Bellec joined me and in 2016 we won a national open data competition. Since then we haven’t pursued anything major, instead we use OpenBikes to try out technologies and to apply concepts we learn at university and online.

Before the 2016 summer holidays one of my professors, Aurélien Garivier mentioned that he was considering using our data for a Kaggle-like competition between some statistics curriculums in France. Near the end of the summer, I sat down with a group of professors and we decided upon a format for the so-called “Challenge”. The general idea was to provide student teams with historical data on multiple bike stations and ask them to do some forecasting which we would then score based on a secret truth. The whole thing lasted from the 5th of October 2016 till the 26th of January 2017 when the best team was crowned.

The challenge was split into two phases. During the first phase, the teams were provided with data spanning from the 1st of April until the 5th of October at 10 AM. The data contained updates on the number of bikes at each station, the geographical position of the stations and the weather in each city. They were asked to forecast the number of bikes at 30 stations for 10 fixed timesteps ranging from the 5th of October at 10 AM until the 9th of October. We picked 10 stations from Toulouse, Lyon and Paris. Each team had access to an account page where they could deposit their submission which were then automatically scored. A public leaderboard was available at the homepage of the website Axel and I built.

During the second part of the challenge, which lasted from the 12th of January 2017 until the 20th of January 2017, the teams were provided with a new dataset containing similar data than the first part, except that Lyon has been swapped out for New-York. The data went from the 1st of April 2016 until the 11th of January 2017. The timesteps to predict were the same - both test sets went from a Wednesday till a Sunday. The teams did not get any feedback when they made a submission during the second part, they were scored based on their last submission, à l’aveugle.

In each part of the challenge the chosen metric to score the students was the mean absolute error between their submissions and the truth. The use of the MAE makes it possible to say things such as “team A was, on average, 3.2 bikes off target”.

Technical notes

Axel and I had been collecting freely available bike sharing data since October 2015. To do this we put in a place a homemade crawler which would interrogate various APIs and aggregate their data into a single format. We stored the number of bikes at each station at different timesteps with MongoDB. The metadata concerning the of the cities and the bike stations was stored with PostgreSQL. We also exposed an API to be able to use our data in an hypothetical mobile app. We deployed our crawler/API on a 20$ DigitalOcean server. The glue language was Python. The whole thing is available on GitHub.

To host the challenge I wrote a simple Django application (my first time!) which Axel kindly deployed on the same server as the crawler. The application used a SQLite as a database backend, partly because I wanted to try it out in production but because anything more powerful was unnecessary. Moreover SQLite stores it’s data in *.db file which can easily be transfered for doing some descriptive statistics. Again, the code is available on GitHub.

Results

The turnout was quite high considering the fact that we didn’t put that much effort into the matter. All in all 8 curriculums and 50 teams took part in the challenge. A total of 947 valid submissions were made, which makes an average of 19 submissions per teams and 7 submissions a day - our server could easily handle that! Of course the rate at which teams submitted wasn’t uniform through time, as can be seen on the following chart.

submissions_per_day
New Year resolutions seem to have kicked in

Philippe Besse suggested looking into the relationship between the number of submissions and the best score per team. The idea was to see if any overfitting had occured, in other words that the best scores were obtained by making many submissions with small adjustments. As can be seen on the following chart, an expected phenomenon arises: teams that have the best scores are usually the ones that have submitted more than others. Interestingly this becomes less obvious the more the teams submit, which basically means that getting a better score becomes harder and harder - this is always the case in data science competitions. To illustrate this phenomenon I fitted an exponential curve to the data.

score_vs_submissions
An exponential trend can vaguely be seen

The final public leaderboard (for the first part that is) was the following.

Team nameCurriculumBest scoreNumber of submissions
Dream TeamISAE - SUPAERO2.794423076923077349
Mr NobodyUniversité de Bordeaux3.4914
Oh l'équipeUniversité de Bordeaux3.4945
PrédiXEcole Polytechnique3.540233333333232347
Louison BobetStatEco - Toulouse School of Economics3.6161
ArmstrongGMM - INSA3.6235
OpenBikesCMISID - Université Paul Sabatier3.629309187370767
RavenclawUniversité de Bordeaux3.6764
LA ROUE ARRIÈRECMISID - Université Paul Sabatier3.6792643553006856
WeLoveTheHailGMM - INSA3.7033333333333335
GMMerckxGMM - INSA3.726666666666666620
TEAM_SKYUniversité de Bordeaux3.733333333333333417
zoomzoomUniversité de Bordeaux3.733333333333333421
KAMEHAMEHAUniversité de Bordeaux3.733333333333333421
TricyclesGMM - INSA3.743333333333333102
AfricanCommunityCMISID - Université Paul Sabatier3.748041189550795745
AvermaertUniversité de Bordeaux3.7521
ZiZou98 <3GMM - INSA3.77666666666666754
LesCyclosTouristesCMISID - Université Paul Sabatier3.806666666666666633
Ul-TeamUniversité de Bordeaux3.86197857742740026
Les Grosses DonnéesMAPI3 - Université Paul Sabatier3.8663053339587166
H2OGMM - INSA3.883333333333333314
RYJA TeamCMISID - Université Paul Sabatier3.93666666666666653
NSSUniversité de Bordeaux3.94767908260178458
Pas le temps de niaiserCMISID - Université Paul Sabatier3.94867908260178661
TheCrazyInsaneFlyingMonkeySpaceInvadersStatEco - Toulouse School of Economics4.00339929781354114
Jul saint-Jean-la-PuentaStatEco - Toulouse School of Economics4.00339929781354113
Four and oneStatEco - Toulouse School of Economics4.0424
testCMISID - Université Paul Sabatier4.08862613308700138
PedaloGMM - INSA4.0886261330875811
alelaUniversité de Bordeaux4.1733333333333332
MoAxGMM - INSA4.19990518246324520
Pas de PauUniversité de Pau4.259120690221482
Le GruppettoStatEco - Toulouse School of Economics4.3333333333333338
LolilolUniversité de Bordeaux4.3381407578786221
DataScientist2017StatEco - Toulouse School of Economics5.00333333333333311
SRAMMAPI3 - Université Paul Sabatier5.104728138150541
OutliersMAPI3 - Université Paul Sabatier5.313333333333333528
Jean Didier Vélo ♯♯MAPI3 - Université Paul Sabatier5.31333333333333355
VelouseCMISID - Université Paul Sabatier5.34314784557503212
TEAM NNBJCMISID - Université Paul Sabatier5.3431478455750321
MAPI3 - Université Paul Sabatier5.495
JMEGMAPI3 - Université Paul Sabatier5.7831416284504037
playerCMISID - Université Paul Sabatier6.313333333333333521
TSE-BigDataStatEco - Toulouse School of Economics6.5369277926286065
kangouUniversité de Bordeaux6.82666666666666716
On aura votre Pau SupaéroUniversité de Pau7.3063107021787211
Les PédalesStatEco - Toulouse School of Economics7.8033333333333341
BIKES FINDERSStatEco - Toulouse School of Economics8.071
Université Bordeaux EnseignantsISAE - SUPAERO8.593
PedaloGMM - INSA8.8766666666666671

Congratulations to team “Dream Team” for winning, by far, the first part of the challenge. The rest of best teams seem to have hit a wall at ~3.6 bikes. This can be seen on the following chart which shows the best score of the 10 best teams along time.

top_10_progression
Dream Team went from 4 to 2.7 all of a sudden

As for the ranking for the second part of the challenge (the blindfolded part), here is the final ranking:

Team nameCurriculumBest score
Le GruppettoStatEco - Toulouse School of Economics3.2229284855662748
OpenBikesCMISID - Université Paul Sabatier3.73651042834827
Louison BobetStatEco - Toulouse School of Economics3.816666666666667
Mr NobodyUniversité de Bordeaux3.94
Oh l'équipeUniversité de Bordeaux3.953333333333333
RavenclawUniversité de Bordeaux4.05
TricyclesGMM - INSA4.24
Four and oneStatEco - Toulouse School of Economics4.45
PrédiXEcole Polytechnique4.523916666666766
WeLoveTheHailGMM - INSA4.596666666666667
ZiZou98 <34.596666666666667
Dream TeamISAE - SUPAERO4.6402666666666725
GMMerckxGMM - INSA4.706091376666668
LA ROUE ARRIÈRECMISID - Université Paul Sabatier4.711600488605138
AvermaertUniversité de Bordeaux4.74
TSE-BigDataStatEco - Toulouse School of Economics4.754570707888592
PedaloGMM - INSA4.755557911517161
testCMISID - Université Paul Sabatier4.755557911517161
ArmstrongGMM - INSA4.87
H2OGMM - INSA4.93
LesCyclosTouristesCMISID - Université Paul Sabatier4.963333333333333
AfricanCommunityCMISID - Université Paul Sabatier4.977632664232967
TheCrazyInsaneFlyingMonkeySpaceInvadersStatEco - Toulouse School of Economics5.092628340778687
NSSUniversité de Bordeaux5.2957133684308015
Ul-TeamUniversité de Bordeaux5.374689402650746
Les PédalesStatEco - Toulouse School of Economics5.492804691024131
Les Grosses DonnéesMAPI3 - Université Paul Sabatier5.62947781887139
VelouseCMISID - Université Paul Sabatier5.666666666666668
DataScientist2017StatEco - Toulouse School of Economics5.756666666666668
JMEGMAPI3 - Université Paul Sabatier5.981138200799902
RYJA TeamCMISID - Université Paul Sabatier6.026666666666666
BIKES FINDERSStatEco - Toulouse School of Economics6.076666666666667
TEAM_SKYUniversité de Bordeaux6.123333333333332
KAMEHAMEHAUniversité de Bordeaux6.123333333333332
zoomzoomUniversité de Bordeaux6.123333333333332
MAPI3 - Université Paul Sabatier6.246666666666667
OutliersMAPI3 - Université Paul Sabatier7.926666666666668
TEAM NNBJCMISID - Université Paul Sabatier8.243333333333334
MoAxGMM - INSA10.946772841269727

Team “Le Gruppetto” is officially the winner of the challenge! The fact that the second part of the competition was blindfolded completely reversed the rankings and favored teams with robust methods whilst penalizing overfitters. Whatsmore, “only” 39 teams took part in the second part (50 did in the first one); maybe some teams felt that their ranking wouldn’t change, but the fact is that “Le Gruppetto” were 34th before being 1st. It isn’t over till the fat lady sings. The following chart shows the best score per team for both parts of the challenge.

part1_vs_part2_scores
I'm the orange spot overlapped by a cyan one in the bottom left!

Who used what?

Every team was asked to submit their code for the second submission. Mostly this was required to make sure no team had cheated by retrieving the data from an API, however this was also the occasion to see what tools the students were using. Here is a brief summary:

  • 23 teams used R (mostly xgboost, dplyr, gbm, randomForest, caret)
  • 19 teams used random forests
  • 15 teams used Python (mostly pandas and sklearn)
  • 8 teams used some form of averaging (which isn’t really machine learning), 5 went further and used the averages as features
  • 7 teams used gradient boosted trees
  • 4 teams used Jupyter with Python
  • 3 teams used model stacking (they averaged the outputs of their individual models)
  • 3 teams used vanilla linear regression
  • 2 teams averaged the number of bikes in the surrounding stations
  • 2 teams used randomized decision trees (they did fairly well)
  • 2 teams used $k$ nearest neighbours
  • 1 team used RMarkdown
  • 1 team used LASSO regression
  • 1 team used a CART
  • 1 team used a SARIMA process
  • 1 team used a recurrent neural network (me!)
  • 1 team used dynamic time warping
  • 1 team used principal component analysis

An interesting fact is that out of the 7 teams who used Python, 4 used it for preparing their data but coded their model in R.

The winners of first part used a random forest, dynamic time warping and principal component decomposition - their Python code is quite hard to grasp! As for the winners of the second part, they coded in R and looked at surrounding stations whilst using gradient boosted trees.

Conclusion

I would like to thank all the students and teachers that participated from the following schools:

Personally I had a great experience organization wise. I had a few panic attacks whilst preparing the CSV files and updating the website, but generally everything went quite smoothly. As a participant my feelings were more mixed; to be totally honest I wasn’t very inclined to participate. It’s difficult to do in front and behind the stage!

Datasets

I’ve made the datasets that were used during both parts of the challenge available - including the true answers - click on this link to download them. Edit in 2024: this link is now dead, but you can find a lot of bike sharing data here.

Feel free to email me if you have any questions.