Max Halfordhttps://maxhalford.github.io/Recent content on Max HalfordHugo -- gohugo.ioen-usmaxhalford25@gmail.com (Max Halford)maxhalford25@gmail.com (Max Halford)Sun, 11 Apr 2021 00:00:00 +0000Reducing the memory footprint of a scikit-learn text classifierhttps://maxhalford.github.io/blog/sklearn-text-classifier-memory-footprint-reduction/Sun, 11 Apr 2021 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/sklearn-text-classifier-memory-footprint-reduction/Context This week at Alan I’ve been working on parsing French medical prescriptions. There are three types of prescriptions: lenses, glasses, and pharmaceutical prescriptions. Different information needs to be extracted depending on the prescription type. Therefore, the first step is to classify the prescription. The prescriptions we receive are pictures taken by users with their phone. We run each image through an OCR to obtain a text transcription of the image.An overview of dataset time travelhttps://maxhalford.github.io/blog/dataset-time-travel/Wed, 07 Apr 2021 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/dataset-time-travel/TLDR You’re a data scientist. The engineers in your company overwrite data in the production database. You want to access overwritten data to train your models. How?
I thought time travel only existed in the movies You’re probably right, expect maybe for this guy.
I want to discuss a concept that’s been on my mind for a while now. I like to call it “dataset time travel” because it has a nice ring to it.Organising a Kaggle InClass competition with a fairness metrichttps://maxhalford.github.io/blog/fairness-competition/Thu, 21 Jan 2021 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/fairness-competition/Some context I co-organised a data science competition during the second half of 2020. This was in fact the 5th edition of the “Défi IA”, which is a recurring event that happens on a yearly basis. It is essentially a supervised machine learning competition for students from French speaking universities and engineering schools. This year was the first time that Kaggle was used to host the competition. Before that we used a custom platform that I wrote during my student years.Converting Amazon Textract tables to pandas DataFrameshttps://maxhalford.github.io/blog/textract-table-to-pandas/Thu, 14 Jan 2021 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/textract-table-to-pandas/I’m currently doing a lot of document processing at work. One of my tasks is to extract tables from PDF files. I evaluated Amazon Textract’s table extraction capability as part of this task. It’s very well documented, as is the rest of Textract. I was slightly disappointed by the examples, but nothing serious.
I wanted to write this short blog post to share a piece of code I use to convert tables extracted through Amazon Textract to pandas.What my PhD was abouthttps://maxhalford.github.io/blog/phd-about/Wed, 06 Jan 2021 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/phd-about/I defended my PhD thesis on the 12th of October 2020, exactly 3 years and 11 days after having started it. The title of my PhD is Machine learning for query selectivity estimation in relational databases. I thought it would be worthwhile to summarise what I did. Note sure anyone will read this, but at least I’ll be able to remember what I did when I grow old and senile.Computing cross-correlations in SQLhttps://maxhalford.github.io/blog/sql-cross-correlations/Tue, 17 Nov 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/sql-cross-correlations/Introduction I’m currently working on a problem at work where I have to measure the impact of a growth initiative on a performance metric. Hypothetically, this might to answer the following kind of question:
I’ve spent X amount of money, what is the impact on the number of visitors on my website?
Of course, there are many measures that can be taken to answer such a question. I decided to measure the correlation between the initiative and the metric, with the latter being shifted forward in time.Unsupervised text classification with word embeddingshttps://maxhalford.github.io/blog/unsupervised-text-classification/Sat, 03 Oct 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/unsupervised-text-classification/Addendum: since writing this article, I have discovered that the method I describe is a form of zero-shot learning. So I guess you could say that this article is a tutorial on zero-shot learning for NLP.
I recently watched a lecture by Adam Tauman Kalai on stereotype bias in text data. The lecture is very good, but something that had nothing to do with the lecture’s main topic caught my intention.Focal loss implementation for LightGBMhttps://maxhalford.github.io/blog/lightgbm-focal-loss/Sun, 20 Sep 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/lightgbm-focal-loss/Edit – 2021-01-26
I initially wrote this blog post using version 2.3.1 of LightGBM. I’ve now updated it to use version 3.1.1. There are a couple of subtle but important differences between version 2.x.y and 3.x.y. If you’re using version 2.x.y, then I strongly recommend you to upgrade to version 3.x.y.
Motivation If you’re reading this blog post, then you’re likely to be aware of LightGBM. The latter is a best of breed gradient boosting library.A few intermediate pandas trickshttps://maxhalford.github.io/blog/pandas-tricks/Mon, 17 Aug 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/pandas-tricks/I want to use this post to share some pandas snippets that I find useful. I use them from time to time, in particular when I’m doing time series competitions on platforms such as Kaggle. Like any data scientist, I perform similar data processing steps on different datasets. Usually, I put repetitive patterns in xam, which is my personal data science toolbox. However, I think that the following snippets are too small and too specific for being added into a library.The correct way to evaluate online machine learning modelshttps://maxhalford.github.io/blog/online-learning-evaluation/Sun, 07 Jun 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/online-learning-evaluation/Motivation Most supervised machine learning algorithms work in the batch setting, whereby they are fitted on a training set offline, and are used to predict the outcomes of new samples. The only way for batch machine learning algorithms to learn from new samples is to train them from scratch with both the old samples and the new ones. Meanwhile, some learning algorithms are online, and can predict as well as update themselves when new samples are available.Server-sent events in Flask without extra dependencieshttps://maxhalford.github.io/blog/flask-sse-no-deps/Mon, 04 May 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/flask-sse-no-deps/Server-sent events (SSE) is a mechanism for sending updates from a server to a client. The fundamental difference with WebSockets is that the communication only goes in one direction. In other words, the client cannot send information to the server. For many usecases this is all you might need. Indeed, if you just want to receive notifications/updates/messages, then using a WebSocket is overkill. Once you’ve implemented the SSE functionality on your server, then all you need on a JavaScript client is an EventSource.I got plagiarized and Google didn't helphttps://maxhalford.github.io/blog/plagiarism-google-didnt-help/Fri, 17 Apr 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/plagiarism-google-didnt-help/One of my most popular articles is the one on target encoding. It gets a fair amount of mentions on Kaggle discussions and I see it pop up from time to time in other contexts. It also brings brought me around 2500 unique monthly viewers. That’s quite a chunk of people for an unambitious blogger like me. Up to a few months ago, my article was on the first page of Google when you typed in searches such as “target encoding python” and “bayesian target encoding”.Speeding up scikit-learn for single predictionshttps://maxhalford.github.io/blog/speeding-up-sklearn-single-predictions/Tue, 31 Mar 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/speeding-up-sklearn-single-predictions/It is now common practice to train machine learning models offline before putting them behind an API endpoint to serve predictions. Specifically, we want an API route which can make a prediction for a single row/instance/sample/data point/individual (call it what you want). Nowadays, we have great tools to do this that care of the nitty-gritty details, such as Cortex, MLFlow, Kubeflow, and Clipper. There are also paid services that hold your hand a bit more, such as DataRobot, H2O, and Cubonacci.Bayesian linear regression for practitionershttps://maxhalford.github.io/blog/bayesian-linear-regression/Wed, 26 Feb 2020 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/bayesian-linear-regression/Motivation Suppose you have an infinite stream of feature vectors $x_i$ and targets $y_i$. In this case, $i$ denotes the order in which the data arrives. If you’re doing supervised learning, then your goal is to estimate $y_i$ before it is revealed to you. In order to do so, you have a model which is composed of parameters denoted $\theta_i$. For instance, $\theta_i$ represents the feature weights when using linear regression.Under-sampling a dataset with desired ratioshttps://maxhalford.github.io/blog/undersampling-ratios/Tue, 17 Dec 2019 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/undersampling-ratios/Introduction I’ve just spent a few hours looking at under-sampling and how it can help a classifier learn from an imbalanced dataset. The idea is quite simple: randomly sample the majority class and leave the minority class untouched. There are more sophisticated ways to do this – for instance by creating synthetic observations from the minority class à la SMOTE – but I won’t be discussing that here.
I checked out the imblearn library and noticed they have an implementation of random under-sampling aptly named RandomUnderSampler.Finding fuzzy duplicates with pandashttps://maxhalford.github.io/blog/transitive-duplicates/Mon, 16 Sep 2019 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/transitive-duplicates/Duplicate detection is the task of finding two or more instances in a dataset that are in fact identical. As an example, take the following toy dataset:
First name Last name Email 0 Erlich Bachman eb@piedpiper.com 1 Erlich Bachmann eb@piedpiper.com 2 Erlik Bachman eb@piedpiper.co 3 Erlich Bachmann eb@piedpiper.com Each of these instances (rows, if you prefer) corresponds to the same “thing” – note that I’m not using the word “entity” because entity resolution is a different, and yet related, concept.A smooth approach to putting machine learning into productionhttps://maxhalford.github.io/blog/machine-learning-production/Sat, 13 Jul 2019 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/machine-learning-production/Putting machine learning into production is hard. Usually I’m doubtful of such statements, but in this case I’ve never met anyone for whom everything has gone smoothly. Most data scientists might agree that there is a huge gap between their local environment and a live environment. In fact, “productionalizing” machine learning is such a complex topic that entire companies have risen to address the issue. I’m not just talking about running a gigantic grid search and finding the best model, I’m talking about putting a machine learning model live so that it actually has a positive impact on your business/project.Skyline queries in Pythonhttps://maxhalford.github.io/blog/skyline-queries/Tue, 21 May 2019 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/skyline-queries/Imagine that you’re looking to buy a home. If you have an analytical mind then you might want to tackle this with a quantitative. Let’s suppose that you have a list of potential homes, and each home has some attributes that can help you compare them. As an example, we’ll consider three attributes:
The price of the house, which you want to minimize The size of the house, which you want to maximize The city where the house if located, which you don’t really care about Some houses will be objectively better than others because they will be cheaper and bigger.SQL subquery enumerationhttps://maxhalford.github.io/blog/sql-subquery-enumeration/Mon, 06 May 2019 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/sql-subquery-enumeration/I recently stumbled on a rather fun problem during my PhD. I wanted to generate all possible subqueries from a given SQL query. In this case an example is easily worth a 1000 thousand words. Take the following SQL query:
SELECT * FROM customers AS c, purchases AS p, shops AS s WHERE p.customer_id = c.id AND p.shop_id = s.id AND c.nationality = 'Swedish' AND c.hair = 'Blond' AND s.city = 'Stockholm' Here all the possible subqueries that can be generated from the above query.Morellet crosses with JavaScripthttps://maxhalford.github.io/blog/morellet/Sun, 03 Feb 2019 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/morellet/The days I’m working on a deep learning project. I hate it but I promised myself to give it a real try. My scripts are taking a long time so I decided to do some procedural art while I waited. This time I’m going to reproduce the following crosses made by François Morellet. I saw them the last I went to the Musée Pompidou with some friends from university. I don’t have any smartphone anymore so one my friends was kind enough to take a few pictures for me, including this one.Streaming groupbys in pandas for big datasetshttps://maxhalford.github.io/blog/pandas-streaming-groupby/Wed, 05 Dec 2018 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/pandas-streaming-groupby/If you’ve done a bit of Kaggling, then you’ve probably been typing a fair share of df.groupby(some_col). That is, if you’re using Python. If you’re handling tabular data, then a lot of your features will revolve around computing aggregate statistics. This is very true for the ongoing PLAsTiCC Astronomical Classification challenge. The goal of the competition is to classify objects in the sky into one of 14 groups. The bulk of the available data is a set of so-called light curve.Target encoding done the right wayhttps://maxhalford.github.io/blog/target-encoding/Sat, 13 Oct 2018 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/target-encoding/When you’re doing supervised learning, you often have to deal with categorical variables. That is, variables which don’t have a natural numerical representation. The problem is that most machine learning algorithms require the input data to be numerical. At some point or another a data science pipeline will require converting categorical variables to numerical variables.
There are many ways to do so:
Label encoding where you choose an arbitrary number for each category One-hot encoding where you create one binary column per category Vector representation a.Stella triangles with JavaScripthttps://maxhalford.github.io/blog/stella-triangles/Thu, 26 Apr 2018 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/stella-triangles/Around the same time last year I visited the San Francisco Museum of Modern Art. Frank Stella’s compositions really caught my eye. When I saw them I started thinking about how I could write a computer program to imitate his work. In this post I’m going to attempt to reproduce his so-called V Series.
Nice and simple right? Indeed in a lot of his work Frank Stella uses straight lines without much randomness.Unknown pleasures with JavaScripthttps://maxhalford.github.io/blog/unknown-pleasures/Mon, 24 Jul 2017 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/unknown-pleasures/No this blog post is not about how nice JavaScript can be, instead it’s just another one of my attempts at reproducing modern art with procedural generation and the HTML5 <canvas> element. This time I randomly generated images resembling the cover of the album by Joy Division called “Unknown Pleasures”.
According to Wikipedia, this somewhat iconic album cover is based on radio waves. I saw a poster of it in a bar not long ago and decided to reproduce the next time I had some time to kill.Subsampling a training set to match a test set - Part 1https://maxhalford.github.io/blog/subsampling-1/Mon, 19 Jun 2017 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/subsampling-1/Some friends and I recently qualified for the final of the 2017 edition of the Data Science Game competition. The first part was a Kaggle competition with data provided by Deezer. The problem was a binary classification task where one had to predict if a user was going to listen to a song that was proposed to him. Like many teams we extracted clever features and trained an XGBoost classifier, classic.Halftoning with Go - Part 2https://maxhalford.github.io/blog/halftoning-2/Mon, 20 Mar 2017 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/halftoning-2/The next stop on my travel through the world of halftoning will be the implementation of Weighted Voronoi Stippling as described in Adrian Secord’s 2002 paper. This method is more involved than the ones I detailed in my previous blog post, however the results are quite interesting. Again, I did the implementation in Go.
Notice the black dot in the middle of the white square? Overview I found a fair amount of resources about the method, most of them being implementations of Adrian Secord’s paper.Grid paintings à la Mondrian with JavaScripthttps://maxhalford.github.io/blog/mondrian/Sat, 04 Mar 2017 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/mondrian/I was at a laundrette today and had just finished my book so I had some time to kill. Naturally I devised an algorithm for generating drawings that would resemble the grid-like paintings that Piet Mondrian made famous. With the benefit of hindsight I guess I could indulge in saner activities while waiting for my laundry to dry!
I went through different ideas but in the end I settled on a recursive approach.A short introduction and conclusion to the OpenBikes 2016 Challengehttps://maxhalford.github.io/blog/openbikes-challenge/Thu, 26 Jan 2017 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/openbikes-challenge/During my undergraduate internship in 2015 I started a side project called OpenBikes. The idea was to visualize and analyze bike sharing over multiple cities. Axel Bellec joined me and in 2016 we won a national open data competition. Since then we haven’t pursued anything major, instead we use OpenBikes to try out technologies and to apply concepts we learn at university and online.
Before the 2016 summer holidays one of my professors, Aurélien Garivier mentioned that he was considering using our data for a Kaggle-like competition between some statistics curriculums in France.Halftoning with Go - Part 1https://maxhalford.github.io/blog/halftoning-1/Sun, 27 Nov 2016 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/halftoning-1/Recently I stumbled upon this webpage which shows how to use a TSP solver as a halftoning technique. I began to read about related concepts like dithering and stippling. I don’t have any background in photography but I can appreciate the visual appeal of these techniques. As I understand it these techniques were first invented to save ink for printing. However nowadays printing has become cheaper and the modern use of these technique is mostly aesthetic, at least for images.Recursive polygons with JavaScripthttps://maxhalford.github.io/blog/recursive-polygons/Fri, 25 Mar 2016 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/recursive-polygons/I like modern art, I enjoy looking at the stuff that was made at the beginning of the 20th century and thinking how it is still shaping today’s style. I’m not an expert, it’s just a hobby of mine. I especially like the Centre Pompidou in Paris, it’s got loads of fascinating stuff. While I was going through the galleries it struck me that some of the paintings were very geometrical.The Naïve Bayes classifierhttps://maxhalford.github.io/blog/naive-bayes/Thu, 10 Sep 2015 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/naive-bayes/The objective of a classifier is to decide to which class (also called label) to assign an observation based on observed data. In supervised learning, this is done by taking into account previous classifications. In other words if we know that certain observations are classified in a certain way, the goal is to determine the class of a new observation. The first group of observations on which the classifier is built is called the training set.An introduction to genetic algorithmshttps://maxhalford.github.io/blog/genetic-algorithms-introduction/Sun, 02 Aug 2015 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/genetic-algorithms-introduction/The goal of genetic algorithms (GAs) is to solve problems whose solutions are not easily found (ie. NP problems, nonlinear optimization, etc.). For example, finding the shortest path from A to B in a directed graph is easily done with Djikstra’s algorithm, it can be solved in polynomial time. However the time to find the smallest path that joins all points on a non-directed graph, also known as the Travelling Salesman Problem (TSP) increases exponentially as the number of points increases.Setting up a droplet to host a Flask apphttps://maxhalford.github.io/blog/flask-droplet/Tue, 14 Jul 2015 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/flask-droplet/After having worked for some weeks on the OpenBikes website, it was time to put it online. Digital Ocean seemed to provide a good service and so I decided to give it a spin. Their documentation is quite good but it doesn’t cover exactly everything for setting up Flask. In this post I simply want to record every single step I took.
OpenBikes is a project with a Flask backend and a few upstart jobs.Visualizing bike stations live datahttps://maxhalford.github.io/blog/bike-stations/Wed, 03 Jun 2015 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/blog/bike-stations/Recently some friends and I decided to launch openbikes.co, a website for visualizing (and later on analyzing) urban bike traffic. We have a lot of ideas that we will progressively implement. Anyway, the point is that all of it started with me fiddling about with the JCDecaux API and the leaflet.js library and I would like to share it with you. Shall we?
Presentation In this post I want to show you the tools and the code to get a fully functional website for visualizing live data.An introduction to symbolic regressionhttps://maxhalford.github.io/slides/symbolic-regression/Mon, 01 Jan 0001 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/slides/symbolic-regression/An introduction to symbolic regression Max Halford - PhD student IRIT/IMT Toulouse Data Science Meetup - December 2017 .center[ .left-column[![tds_logo](/assets/img/presentations/tds_logo.jpeg)] .right-column[![xgp_logo](/assets/img/presentations/xgp_logo.png)] ] --- layout: true # Symbolic regression --- ## Quick overview - The goal is to evolve "programs" with selection, mutation, and crossover - Selection keeps programs that perform well - Mutation changes a piece of the program - Crossover combines two programs --- --- ## Example programs --- ## Kaggle Titanic top 1% 🚢 2 years ago [scirpus](https://www.Biohttps://maxhalford.github.io/bio/Mon, 01 Jan 0001 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/bio/Hello 👋
I’m a full stack data scientist working at Alan. I’m half British 🇬🇧 and half Belgian 🇧🇪. I went to university in Toulouse, France 🇫🇷. My academic 🎓 background is a mix of maths 🧮, economics 💸, and computer science 🖥️. I got hooked into data science in 2014 after watching Moneyball ⚾ and reading The Signal and the Noise 📖. My PhD topic had to do with database query optimisation and machine learning 🤖.Linkshttps://maxhalford.github.io/links/Mon, 01 Jan 0001 00:00:00 +0000maxhalford25@gmail.com (Max Halford)https://maxhalford.github.io/links/Papers PhD - Statistical learning for selectivity estimation in relational databases (manuscript, slides) Selectivity correction with online machine learning - BDA 2020 Selectivity Estimation with Attribute Value Dependencies using Linked Bayesian Networks - TLDKS 2020 An Approach Based on Bayesian Networks for Query Selectivity Estimation - DASFAA, 2019 Entropic Variable Projection for Explainability and Intepretability - 2018 Master 2 year internship at HelloFresh (report, slides) Master 1 year internship at Privateaser (report, slides) Undergraduate internship at INSA Toulouse (report, slides) Detailed solutions to the first 30 Project Euler problems Talks The challenges of online machine learning in production - Itaú Unibanco Meetup 2021 Quelle est l’empreinte écologique du Big Data?