Max Halford

LCA software: exit the matrix

maxhalford25@gmail.com (Max Halford) — Sun, 09 Jun 2024 00:00:00 +0000

Measuring the environmental impact of a product is done using life cycle assessment (LCA). This is a methodology that breaks down a product’s life cycle into stages (LCI), and measures the impact of each stage on the environment (LCIA). There are a few pieces of LCA software to choose from. The leading ones are SimaPro, GaBi, openLCA, and Umberto. These are all proprietary software, and they’re expensive. But there’s a free and open source alternative: Brightway.

Cutting up shoes to measure their footprint

maxhalford25@gmail.com (Max Halford) — Fri, 17 May 2024 00:00:00 +0000

Our mission at Carbonfact is to measure the environmental impact of clothes. This involves a lot of steps. The main one is to determine what materials a product is made of, along with each material’s mass. This is straightforward for most clothes like jumpers and pants. These are typically made of a single fabric, such as cotton or polyester. The mass of each material is roughly the same as the product’s mass.

A training set for bike sharing forecasting

maxhalford25@gmail.com (Max Halford) — Thu, 04 Apr 2024 00:00:00 +0000

Last night I went to a Toulouse Data Science meetup. The talks were about generative AI and information retrieval, which aren’t topics I’m knowledgeable about. However, one of the speakers was a friend of mine, so I went to support him. Toulouse is my hometown, so I bumped into a few people I knew. It was a nice evening. I chatted with an old office mate from when I interned at INSA Toulouse.

Fast Poetry and pre-commit with GitHub Actions

maxhalford25@gmail.com (Max Halford) — Tue, 27 Feb 2024 00:00:00 +0000

This is a short post to share a GitHub Actions pattern I use to setup Poetry and pre-commit. These two tools cover most of my Python development needs. I use Poetry to manage dependencies and pre-commit to run code checks and formatting. The setup is fast because it caches the virtual environment and the .local directory. I like to use custom actions for this type of stuff. These are base actions that can be re-used in multiple workflows.

Decomposing funnel metrics

maxhalford25@gmail.com (Max Halford) — Thu, 14 Dec 2023 00:00:00 +0000

Funnel metrics as products I talked about metric decomposition in a previous article, and how it can be used to explain why metrics change values over time. That article explained how to decompose a sum, as well as a ratio. In this article, I’ll explain how to decompose a product. revenue = impressions * click_rate * conversion_rate * spend The decomposition in this article isn’t limited to funnels. It can be applied to any metric that is expressed as a product of factors.

Efficient ELT refreshes

maxhalford25@gmail.com (Max Halford) — Fri, 01 Dec 2023 00:00:00 +0000

A tenant of the modern data stack is the use of ELT (Extract, Load, Transform) over ETL (Extract, Transform, Load). In a nutshell, this means that most of the data transformation is done in the data warehouse. This has become the de facto standard for modern data teams, and is epitomized by dbt and its ecosystem. It’s a great time to be a data engineer! We at Carbonfact fully embrace the ELT paradigm.

Online machine learning on the road @ IDE+A, TH Köln

maxhalford25@gmail.com (Max Halford) — Thu, 26 Oct 2023 00:00:00 +0000

Sh*t flows downhill, but not at Carbonfact

maxhalford25@gmail.com (Max Halford) — Mon, 16 Oct 2023 00:00:00 +0000

I’m writing this after watching the talk Joe Reis gave at Big Data LDN. It’s called Data Modeling is Dead! Long Live Data Modeling! It’s an easy-to-watch short talk that calls out on a few modern issues in the data world. I’d like to bounce off one of Joe’s slides: I’m aligned with Joe that many issues stem from the lack of unison between engineering and data teams. A fundamental aspect of the Modern Data Stack is to replicate/copy production data into an analytics warehouse.

Answering "Why did the KPI change?" using decomposition

maxhalford25@gmail.com (Max Halford) — Wed, 09 Aug 2023 00:00:00 +0000

Edit – I published a notebook here that deals with the case where dimension values may (dis)appear from one period of time to the next. The notebook decomposes a ratio, but the logic is also valid for decomposing a sum. Edit 2 – I’ve stumbled on this article by Shao Zhifei which provides a good derivation of the ratio decomposition formula. I contacted Shao Zhifei on LinkedIn, and he told me they heavily use these formulas at Grab.

Measuring the carbon footprint of pizzas

maxhalford25@gmail.com (Max Halford) — Sun, 25 Jun 2023 00:00:00 +0000

Making environmentally friendly decisions can only be done with the right information. At Carbonfact, we’ve realized a big challenge is the lack of information about industrial processes. We tackle that slowly but surely by gathering data from various sources, and making it available to our customers. Regarding food, the French government has a great initiative called Agribalyse. It’s a free database of environmental footprints for various food products. It includes raw ingredients straight out from the farm, as well as ready to eat dishes from the supermarket.

Graph components with DuckDB

maxhalford25@gmail.com (Max Halford) — Sat, 03 Jun 2023 00:00:00 +0000

Introduction Graph problems are quite common. However, it’s rare to have access to a database offering graph semantics. There are graph databases, such as Neo4j and GraphX, but it’s difficult to justify setting one of those up. One could simply use networkx in Python. But that only works if the graph fits in memory. From a practical angle, the fact is that people are querying data warehouses in SQL. There are many good reasons to write graph algorithms in SQL.

For analytics, don't use dynamic JSON keys

maxhalford25@gmail.com (Max Halford) — Thu, 11 May 2023 00:00:00 +0000

I love the JSON format. It’s the kind of love that grows on you with time. Like others, I’ve been using JSON everywhere for so many years, to the point where I just take it for granted. I suppose the main thing I like about JSON is its flexibility. You can structure your JSONs without too much care. There will always be a way to consume and manipulate it. But I have discovered a bit of anti-pattern, which I believe is worth raising awareness about.

Metric correctness doesn't matter, consistency does

maxhalford25@gmail.com (Max Halford) — Fri, 28 Apr 2023 00:00:00 +0000

According to the United Nations, the 15th of November was the day we crossed 8 billion humans on the planet. How can they be so sure of that? Surely there has to be some margin of error, meaning it could have happened on the 14th or 16th. Then again, does it matter? I would argue almost all metrics we look at are incorrect. For instance, I work at a company who’s goal is to measure the carbon footprint of clothing items.

Online gradient descent written in SQL

maxhalford25@gmail.com (Max Halford) — Tue, 07 Mar 2023 00:00:00 +0000

Edit – this post generated a few insightful comments on Hacker News. I’ve also put the code in a notebook for ease of use. Introduction Modern MLOps is complex because it involves too many components. You need a message bus, a stream processing engine, an API, a model store, a feature store, a monitoring service, etc. Sadly, containerisation software and the unbundling trend have encouraged an appetite for complexity. I believe MLOps shouldn’t be this complex.

Using SymPy in Python doctests

maxhalford25@gmail.com (Max Halford) — Wed, 15 Feb 2023 00:00:00 +0000

A program which compiles and runs without errors isn’t necessarily correct. I find this to be especially true for statistical software, both as a developer and as a user. Small but nasty bugs creep up on me every week. I keep sane in the membrane by writing many unit tests 🐛🔨 I make heavy use of doctests. These are unit tests which you write as Python docstrings. They’re really handy because they kill two birds with one stone: the unit tests you write for a function also act as documentation.

Online active learning in 80 lines of Python

maxhalford25@gmail.com (Max Halford) — Sun, 22 Jan 2023 00:00:00 +0000

Active learning is a way to get humans to label data efficiently. A good active learning strategy minimizes the number of necessary labels, while maximizing a model’s performance. This usually works by focusing on samples where the model is unsure of its prediction. In a batch setting, the model is periodically retrained to learn from the freshly labeled samples. However, the training time is usually too prohibitive for this to happen each time a new label is provided.

Are Airbnb guests less energy efficient than their host?

maxhalford25@gmail.com (Max Halford) — Tue, 17 Jan 2023 00:00:00 +0000

TLDR I compared the energy consumption of Airbnb guests versus their host, in the same appartment, during 2022. It appears that guests do in fact consume more energy than hosts. The data I used is available to any Airbnb host. I also open-sourced all the code I wrote for this analysis. Introduction European energy prices have soared in 2022. It’s gone to the point where some Airbnb hosts have become reluctant to rent, believing their guests are too wasteful and cost too much.

The future of River

maxhalford25@gmail.com (Max Halford) — Tue, 13 Dec 2022 00:00:00 +0000

Source When I see tweets like this one, I’m both happy because people are aware of River, but also irked because it’s really difficult to make production-grade open source software. We just had a developer meeting a week ago. We planned what we will work on during the first half of 2023. I thought it would be worthwhile to give a high-level view of how we envision River’s future. If not to be comprehensive, at least to reassure potential users that River is alive and kicking 🤺

Parsing garment descriptions with GPT-3

maxhalford25@gmail.com (Max Halford) — Sun, 20 Nov 2022 00:00:00 +0000

The task You’ll have heard of GPT-3 if you haven’t been hiding under a rock. I’ve recently been impressed by Nat Friedman teaching GPT-3 to use a browser, and SeekWell generating SQL queries from free-text. I think the most exciting usecases are yet to come. But GPT-3 has a good chance of changing the way we approach mundane tasks at work. I wrote an article a couple of months ago about a boring task I have to do at work.

Dynamic on-screen TV keyboards

maxhalford25@gmail.com (Max Halford) — Sun, 25 Sep 2022 00:00:00 +0000

This article has some interactive keyboards, therefore I recommend reading it from your computer rather than your phone. On-screen TV keyboards I’ve recently been spending time at my brother’s place. We usually eat in front of TV. I’ve thus found myself typing stuff on the Netflix/Amazon/Plex TV apps. The typing happens through a remote controller, which is slower than typing with ones fingers. However, the TV apps usually suggest the correct show/movie after five or six keystrokes, so it’s not that bad.

NLP at Carbonfact: how would you do it?

maxhalford25@gmail.com (Max Halford) — Tue, 06 Sep 2022 00:00:00 +0000

The task I work at a company called Carbonfact. Our core value proposal is computing the carbon footprint of clothing items, expressed in carbon dioxide equivalent – $kgCO_2e$ in short. For instance, we started by measuring the footprint of shoes – no pun intended. We do these measurements with life cycle analysis (LCA) software we built ourselves. We use these analyses to fuel higher-level tasks for our clients, such as carbon accounting and sustainable procurement.

Matrix inverse mini-batch updates

maxhalford25@gmail.com (Max Halford) — Wed, 24 Aug 2022 00:00:00 +0000

The inverse covariance matrix, also called precision matrix, is useful in many places across the field of statistics. For instance, in machine learning, it is used for Bayesian regression and mixture modelling. What’s interesting is that any batch model which uses a precision matrix can be turned into an online model. That is, provided the precision matrix can be estimated in a streaming fashion. For instance, scikit-learn’s elliptic envelope method could have an online variant with a partial_fit method.

A rant against dbt ref

maxhalford25@gmail.com (Max Halford) — Tue, 28 Jun 2022 00:00:00 +0000

Disclaimer Let me be absolutely clear: I think dbt is a great tool. Although this post is a rant, the goal is to be constructive and suggest an improvement. dbt in a nutshell dbt is a workflow orchestrator for SQL. In other words, it’s a fancy Make for data analytics. What makes dbt special is that it is the first workflow orchestrator that is dedicated to the SQL language. It said out loud what many data teams were thinking: you can get a lot done with SQL.

First IRL meetup with the River developers

maxhalford25@gmail.com (Max Halford) — Thu, 09 Jun 2022 00:00:00 +0000

River is a Python software for doing online machine learning. It’s the result of a merger in early 2020 between creme and scikit-multiflow. Saulo Mastelini, Jacob Montiel, and myself are the three core developers. But there are many more people who contribute here and there! This week Saulo Mastelini and I got to meet in person. This is worth mentioning because Saulo is originally from Brazil, whereas I’m based in Europe.

Online machine learning with River @ GAIA

maxhalford25@gmail.com (Max Halford) — Thu, 07 Apr 2022 00:00:00 +0000

Fuzzy regex matching in Python

maxhalford25@gmail.com (Max Halford) — Mon, 04 Apr 2022 00:00:00 +0000

Fuzzy string matching in a nutshell Say we’re looking for a pattern in a blob of text. If you know the text has no typos, then determining whether it contains a pattern is trivial. In Python you can use the in function. You can also write a regex pattern with the re module from the standard library. But what about if the text contains typos? For instance, this might be the case with user inputs on a website, or with OCR outputs.

OCR spelling correction is hard

maxhalford25@gmail.com (Max Halford) — Sun, 06 Mar 2022 00:00:00 +0000

I recently saw SymSpell pop up on Hackernews. It claims to be a million times faster than Peter Norvig’s spelling corrector. I think it’s great that there’s a fast open source solution for spelling correction. But in my experience, the most challenging aspect of spelling correction is not necessarily speed. When I worked at Alan, I mostly wrote logic to extract structured information from medical documents. After some months working on the topic, I have to admit I hadn’t cracked the problem.

Comic book panel segmentation

maxhalford25@gmail.com (Max Halford) — Sat, 05 Mar 2022 00:00:00 +0000

Edit (2023-05-26) – I’ve learnt about the Kumiko project, which is exactly devoted to slicing comic book panels. There’s even a live tool. I discovered it thanks to being pinged on this issue. Motivation I’ve recently been reading some comic books I used to devour as a kid. Especially those from the golden era of francophone comics: Thorgal, Lanfeust, XIII, Tintin, Largo Winch, Blacksad, Aldebaran, etc. It’s not easy to get my hands on many of them.

Online machine learning in practice @ PyData PDX

maxhalford25@gmail.com (Max Halford) — Wed, 09 Feb 2022 00:00:00 +0000

The online machine learning predict/fit switcheroo

maxhalford25@gmail.com (Max Halford) — Thu, 06 Jan 2022 00:00:00 +0000

Why I’m writing this Fact: designing open source software is hard. It’s difficult to make design decisions which don’t make any compromises. I like to fall back on Dieter Rams’ 10 principles for good design. I feel like they apply rather well to software design. Especially when said software is open source, due to the many users and the plethora of use cases. I had to make a significant design decision for River.

Weighted sampling without replacement in pure Python

maxhalford25@gmail.com (Max Halford) — Fri, 24 Dec 2021 00:00:00 +0000

I’m working on a problem where I need to sample k items from a list without replacement. The sampling has to be weighted. In Python, numpy has random.choice method which allows doing this: import numpy as np n = 10 k = 3 np.random.seed(42) population = np.arange(n) weights = np.random.dirichlet(np.ones_like(population)) np.random.choice(population, size=k, replace=False, p=weights) array([0, 9, 8]) I’m always wary of using numpy without thinking because I know it incurs some overhead.

Online machine learning in practice @ Applied AI

maxhalford25@gmail.com (Max Halford) — Fri, 17 Dec 2021 00:00:00 +0000

Online machine learning in practice @ LVMH

maxhalford25@gmail.com (Max Halford) — Fri, 10 Dec 2021 00:00:00 +0000

Web scraping, upside down

maxhalford25@gmail.com (Max Halford) — Thu, 11 Nov 2021 00:00:00 +0000

Motivation Web scraping is the art of extracting information from web pages. A web page is essentially an amalgamation of HTML tags. Usually, we’re looking for a particular piece of information on a given web page. This may be done by fetching the HTML content of the page in question, and then running some HTML parsing logic. It’s quite straightforward. There are many tools in the wild to perform web scraping.

One year at Alan

maxhalford25@gmail.com (Max Halford) — Tue, 26 Oct 2021 00:00:00 +0000

Context Today marks the 1 year anniversary since I started working at Alan. It’s my first real job, and certainly the place where I grew up the most professionally. I’m writing this post to summarise what I did and what I learnt at Alan. Alan is a special company. It has a unique culture that is starting to become famous in France. I won’t expand on the way things work at Alan, and will simply focus on the way I experienced it.

Manipulating ephemeral data with git

maxhalford25@gmail.com (Max Halford) — Thu, 07 Oct 2021 00:00:00 +0000

Dashboards and GROUPING SETS

maxhalford25@gmail.com (Max Halford) — Fri, 10 Sep 2021 00:00:00 +0000

Motivation At Alan, we do almost all our data analysis in SQL. Our data warehouse used to be PostgreSQL, and have since switched to Snowflake for performance reasons. We load data into our warehouse with Airflow. This includes dumps of our production database, third-party data, and health data from other actors in the health ecosystem. This is raw data. We transform this into prepared data via an in-house tool that resembles dbt.

Homoglyphs: different characters that look identical

maxhalford25@gmail.com (Max Halford) — Thu, 19 Aug 2021 00:00:00 +0000

A wild homoglyph appears For instance, can you tell if there’s a difference between H and Η? How about N and Ν? These characters may seem identical, but they are actually different. You can try this out for yourself in Python: >>> 'H' == 'Η' False >>> 'N' == 'Ν' False Indeed, these all represent different Unicode characters: >>> ord('H'), ord('Η') (72, 919) >>> ord('N'), ord('Ν') (78, 925) Η in fact represents the capital Eta letter, while Ν is a capital Nu.

Automated document processing at Alan

maxhalford25@gmail.com (Max Halford) — Thu, 10 Jun 2021 00:00:00 +0000

Text classification by data compression

maxhalford25@gmail.com (Max Halford) — Tue, 08 Jun 2021 00:00:00 +0000

Edit – I posted this on Hackernews and got some valuable feedback. Many brought up the fact that you should be able to reuse the internal state of the compressor instead of recompressing the training data each time a prediction is made. There’s also some insightful references to data compression theory and its ties to statistical learning Last night I felt like reading Artificial Intelligence: A Modern Approach. I stumbled on something fun in the natural language processing chapter.

Reducing the memory footprint of a scikit-learn text classifier

maxhalford25@gmail.com (Max Halford) — Sun, 11 Apr 2021 00:00:00 +0000

Context This week at Alan I’ve been working on parsing French medical prescriptions. There are three types of prescriptions: lenses, glasses, and pharmaceutical prescriptions. Different information needs to be extracted depending on the prescription type. Therefore, the first step is to classify the prescription. The prescriptions we receive are pictures taken by users with their phone. We run each image through an OCR to obtain a text transcription of the image.

An overview of dataset time travel

maxhalford25@gmail.com (Max Halford) — Wed, 07 Apr 2021 00:00:00 +0000

TLDR You’re a data scientist. The engineers in your company overwrite data in the production database. You want to access overwritten data to train your models. How? I thought time travel only existed in the movies You’re probably right, expect maybe for this guy. I want to discuss a concept that’s been on my mind for a while now. I like to call it “dataset time travel” because it has a nice ring to it.

The challenges of online machine learning in production @ Itaú Unibanco

maxhalford25@gmail.com (Max Halford) — Fri, 26 Feb 2021 00:00:00 +0000

Quelle est l’empreinte écologique du Big Data? @ Toulouse Tech

maxhalford25@gmail.com (Max Halford) — Fri, 22 Jan 2021 00:00:00 +0000

Organising a Kaggle InClass competition with a fairness metric

maxhalford25@gmail.com (Max Halford) — Thu, 21 Jan 2021 00:00:00 +0000

Some context I co-organised a data science competition during the second half of 2020. This was in fact the 5th edition of the “Défi IA”, which is a recurring event that happens on a yearly basis. It is essentially a supervised machine learning competition for students from French speaking universities and engineering schools. This year was the first time that Kaggle was used to host the competition. Before that we used a custom platform that I wrote during my student years.

Converting Amazon Textract tables to pandas DataFrames

maxhalford25@gmail.com (Max Halford) — Thu, 14 Jan 2021 00:00:00 +0000

I’m currently doing a lot of document processing at work. One of my tasks is to extract tables from PDF files. I evaluated Amazon Textract’s table extraction capability as part of this task. It’s very well documented, as is the rest of Textract. I was slightly disappointed by the examples, but nothing serious. I wanted to write this short blog post to share a piece of code I use to convert tables extracted through Amazon Textract to pandas.

What my PhD was about

maxhalford25@gmail.com (Max Halford) — Wed, 06 Jan 2021 00:00:00 +0000

I defended my PhD thesis on the 12th of October 2020, exactly 3 years and 11 days after having started it. The title of my PhD is Machine learning for query selectivity estimation in relational databases. I thought it would be worthwhile to summarise what I did. Note sure anyone will read this, but at least I’ll be able to remember what I did when I grow old and senile.

Computing cross-correlations in SQL

maxhalford25@gmail.com (Max Halford) — Tue, 17 Nov 2020 00:00:00 +0000

Introduction I’m currently working on a problem at work where I have to measure the impact of a growth initiative on a performance metric. Hypothetically, this might to answer the following kind of question: I’ve spent X amount of money, what is the impact on the number of visitors on my website? Of course, there are many measures that can be taken to answer such a question. I decided to measure the correlation between the initiative and the metric, with the latter being shifted forward in time.

Unsupervised text classification with word embeddings

maxhalford25@gmail.com (Max Halford) — Sat, 03 Oct 2020 00:00:00 +0000

Edit – since writing this article, I have discovered that the method I describe is a form of zero-shot learning. So I guess you could say that this article is a tutorial on zero-shot learning for NLP. Edit – I stumbled on a paper entitled “Towards Unsupervised Text Classification Leveraging Experts and Word Embeddings” which proposes something very similar. The paper is rather well written, so you might want to check it out.

Focal loss implementation for LightGBM

maxhalford25@gmail.com (Max Halford) — Sun, 20 Sep 2020 00:00:00 +0000

Edit (2021-01-26) – I initially wrote this blog post using version 2.3.1 of LightGBM. I’ve now updated it to use version 3.1.1. There are a couple of subtle but important differences between version 2.x.y and 3.x.y. If you’re using version 2.x.y, then I strongly recommend you to upgrade to version 3.x.y. Motivation If you’re reading this blog post, then you’re likely to be aware of LightGBM. The latter is a best of breed gradient boosting library.

A few intermediate pandas tricks

maxhalford25@gmail.com (Max Halford) — Mon, 17 Aug 2020 00:00:00 +0000

I want to use this post to share some pandas snippets that I find useful. I use them from time to time, in particular when I’m doing time series competitions on platforms such as Kaggle. Like any data scientist, I perform similar data processing steps on different datasets. Usually, I put repetitive patterns in xam, which is my personal data science toolbox. However, I think that the following snippets are too small and too specific for being added into a library.

A brief introduction to online machine learning @ Hong Kong Machine Learning Meetup

maxhalford25@gmail.com (Max Halford) — Wed, 10 Jun 2020 00:00:00 +0000

The correct way to evaluate online machine learning models

maxhalford25@gmail.com (Max Halford) — Sun, 07 Jun 2020 00:00:00 +0000

Motivation Most supervised machine learning algorithms work in the batch setting, whereby they are fitted on a training set offline, and are used to predict the outcomes of new samples. The only way for batch machine learning algorithms to learn from new samples is to train them from scratch with both the old samples and the new ones. Meanwhile, some learning algorithms are online, and can predict as well as update themselves when new samples are available.

Online machine learning with decision trees @ Toulouse AOC workgroup

maxhalford25@gmail.com (Max Halford) — Thu, 07 May 2020 00:00:00 +0000

Server-sent events in Flask without extra dependencies

maxhalford25@gmail.com (Max Halford) — Mon, 04 May 2020 00:00:00 +0000

Server-sent events (SSE) is a mechanism for sending updates from a server to a client. The fundamental difference with WebSockets is that the communication only goes in one direction. In other words, the client cannot send information to the server. For many usecases this is all you might need. Indeed, if you just want to receive notifications/updates/messages, then using a WebSocket is overkill. Once you’ve implemented the SSE functionality on your server, then all you need on a JavaScript client is an EventSource.

I got plagiarized and Google didn't help

maxhalford25@gmail.com (Max Halford) — Fri, 17 Apr 2020 00:00:00 +0000

One of my most popular articles is the one on target encoding. It gets a fair amount of mentions on Kaggle discussions and I see it pop up from time to time in other contexts. It also brings me around 2500 unique monthly viewers. That’s quite a chunk of people for an unambitious blogger like me. Up to a few months ago, my article was on the first page of Google when you typed in searches such as “target encoding python” and “bayesian target encoding”.

Our solution to the IDAO 2020 qualifiers

maxhalford25@gmail.com (Max Halford) — Sun, 12 Apr 2020 00:00:00 +0000

Speeding up scikit-learn for single predictions

maxhalford25@gmail.com (Max Halford) — Tue, 31 Mar 2020 00:00:00 +0000

It is now common practice to train machine learning models offline before putting them behind an API endpoint to serve predictions. Specifically, we want an API route which can make a prediction for a single row/instance/sample/data point/individual (call it what you want). Nowadays, we have great tools to do this that care of the nitty-gritty details, such as Cortex, MLFlow, Kubeflow, and Clipper. There are also paid services that hold your hand a bit more, such as DataRobot, H2O, and Cubonacci.

Machine learning for streaming data with creme

maxhalford25@gmail.com (Max Halford) — Thu, 26 Mar 2020 00:00:00 +0000

Global explanation of machine learning with sensitivity analysis @ MASCOT-NUM

maxhalford25@gmail.com (Max Halford) — Tue, 10 Mar 2020 00:00:00 +0000

Bayesian linear regression for practitioners

maxhalford25@gmail.com (Max Halford) — Wed, 26 Feb 2020 00:00:00 +0000

Motivation Suppose you have an infinite stream of feature vectors $x_i$ and targets $y_i$. In this case, $i$ denotes the order in which the data arrives. If you’re doing supervised learning, then your goal is to estimate $y_i$ before it is revealed to you. In order to do so, you have a model which is composed of parameters denoted $\theta_i$. For instance, $\theta_i$ represents the feature weights when using linear regression.

Under-sampling a dataset with desired ratios

maxhalford25@gmail.com (Max Halford) — Tue, 17 Dec 2019 00:00:00 +0000

Introduction I’ve just spent a few hours looking at under-sampling and how it can help a classifier learn from an imbalanced dataset. The idea is quite simple: randomly sample the majority class and leave the minority class untouched. There are more sophisticated ways to do this – for instance by creating synthetic observations from the minority class à la SMOTE – but I won’t be discussing that here. I checked out the imblearn library and noticed they have an implementation of random under-sampling aptly named RandomUnderSampler.

The benefits of online machine learning @ Quantmetry

maxhalford25@gmail.com (Max Halford) — Tue, 29 Oct 2019 00:00:00 +0000

The benefits of online machine learning @ Element AI

maxhalford25@gmail.com (Max Halford) — Wed, 23 Oct 2019 00:00:00 +0000

Finding fuzzy duplicates with pandas

maxhalford25@gmail.com (Max Halford) — Mon, 16 Sep 2019 00:00:00 +0000

Duplicate detection is the task of finding two or more instances in a dataset that are in fact identical. As an example, take the following toy dataset: First name Last name Email 0 Erlich Bachman eb@piedpiper.com 1 Erlich Bachmann eb@piedpiper.com 2 Erlik Bachman eb@piedpiper.co 3 Erlich Bachmann eb@piedpiper.com Each of these instances (rows, if you prefer) corresponds to the same “thing” – note that I’m not using the word “entity” because entity resolution is a different, and yet related, concept.

A smooth approach to putting machine learning into production

maxhalford25@gmail.com (Max Halford) — Sat, 13 Jul 2019 00:00:00 +0000

Putting machine learning into production is hard. Usually I’m doubtful of such statements, but in this case I’ve never met anyone for whom everything has gone smoothly. Most data scientists might agree that there is a huge gap between their local environment and a live environment. In fact, “productionalizing” machine learning is such a complex topic that entire companies have risen to address the issue. I’m not just talking about running a gigantic grid search and finding the best model, I’m talking about putting a machine learning model live so that it actually has a positive impact on your business/project.

The benefits of online machine learning @ Airbus Bizlab

maxhalford25@gmail.com (Max Halford) — Fri, 28 Jun 2019 00:00:00 +0000

Machine learning incrémental: des concepts à la pratique @ Toulouse Data Science Meetup

maxhalford25@gmail.com (Max Halford) — Tue, 28 May 2019 00:00:00 +0000

Skyline queries in Python

maxhalford25@gmail.com (Max Halford) — Tue, 21 May 2019 00:00:00 +0000

Imagine that you’re looking to buy a home. If you have an analytical mind then you might want to tackle this with a quantitative. Let’s suppose that you have a list of potential homes, and each home has some attributes that can help you compare them. As an example, we’ll consider three attributes: The price of the house, which you want to minimize The size of the house, which you want to maximize The city where the house if located, which you don’t really care about Some houses will be objectively better than others because they will be cheaper and bigger.

Online machine learning with creme @ PyData Amsterdam

maxhalford25@gmail.com (Max Halford) — Sat, 11 May 2019 00:00:00 +0000

SQL subquery enumeration

maxhalford25@gmail.com (Max Halford) — Mon, 06 May 2019 00:00:00 +0000

I recently stumbled on a rather fun problem during my PhD. I wanted to generate all possible subqueries from a given SQL query. In this case an example is easily worth a 1000 thousand words. Take the following SQL query: SELECT * FROM customers AS c, purchases AS p, shops AS s WHERE p.customer_id = c.id AND p.shop_id = s.id AND c.nationality = 'Swedish' AND c.hair = 'Blond' AND s.city = 'Stockholm' Here all the possible subqueries that can be generated from the above query.

An approach based on Bayesian networks for query selectivity estimation @ DASFAA

maxhalford25@gmail.com (Max Halford) — Mon, 22 Apr 2019 00:00:00 +0000

Morellet crosses with JavaScript

maxhalford25@gmail.com (Max Halford) — Sun, 03 Feb 2019 00:00:00 +0000

The days I’m working on a deep learning project. I hate it but I promised myself to give it a real try. My scripts are taking a long time so I decided to do some procedural art while I waited. This time I’m going to reproduce the following crosses made by François Morellet. I saw them the last I went to the Musée Pompidou with some friends from university. I don’t have any smartphone anymore so one my friends was kind enough to take a few pictures for me, including this one.

Streaming groupbys in pandas for big datasets

maxhalford25@gmail.com (Max Halford) — Wed, 05 Dec 2018 00:00:00 +0000

If you’ve done a bit of Kaggling, then you’ve probably been typing a fair share of df.groupby(some_col). That is, if you’re using Python. If you’re handling tabular data, then a lot of your features will revolve around computing aggregate statistics. This is very true for the ongoing PLAsTiCC Astronomical Classification challenge. The goal of the competition is to classify objects in the sky into one of 14 groups. The bulk of the available data is a set of so-called light curve.

Target encoding done the right way

maxhalford25@gmail.com (Max Halford) — Sat, 13 Oct 2018 00:00:00 +0000

When you’re doing supervised learning, you often have to deal with categorical variables. That is, variables which don’t have a natural numerical representation. The problem is that most machine learning algorithms require the input data to be numerical. At some point or another a data science pipeline will require converting categorical variables to numerical variables. There are many ways to do so: Label encoding where you choose an arbitrary number for each category One-hot encoding where you create one binary column per category Vector representation a.

Stella triangles with JavaScript

maxhalford25@gmail.com (Max Halford) — Thu, 26 Apr 2018 00:00:00 +0000

Around the same time last year I visited the San Francisco Museum of Modern Art. Frank Stella’s compositions really caught my eye. When I saw them I started thinking about how I could write a computer program to imitate his work. In this post I’m going to attempt to reproduce his so-called V Series. Nice and simple right? Indeed in a lot of his work Frank Stella uses straight lines without much randomness.

Unknown pleasures with JavaScript

maxhalford25@gmail.com (Max Halford) — Mon, 24 Jul 2017 00:00:00 +0000

No this blog post is not about how nice JavaScript can be, instead it’s just another one of my attempts at reproducing modern art with procedural generation and the HTML5 <canvas> element. This time I randomly generated images resembling the cover of the album by Joy Division called “Unknown Pleasures”. According to Wikipedia, this somewhat iconic album cover is based on radio waves. I saw a poster of it in a bar not long ago and decided to reproduce the next time I had some time to kill.

Subsampling a training set to match a test set - Part 1

maxhalford25@gmail.com (Max Halford) — Mon, 19 Jun 2017 00:00:00 +0000

Edit – it’s 2022 and I still haven’t written a part 2. That’s because I believe this problem is easily solved with adversarial validation. Some friends and I recently qualified for the final of the 2017 edition of the Data Science Game competition. The first part was a Kaggle competition with data provided by Deezer. The problem was a binary classification task where one had to predict if a user was going to listen to a song that was proposed to him.

Docker for data science @ HelloFresh Berlin

maxhalford25@gmail.com (Max Halford) — Thu, 01 Jun 2017 00:00:00 +0000

Halftoning with Go - Part 2

maxhalford25@gmail.com (Max Halford) — Mon, 20 Mar 2017 00:00:00 +0000

The next stop on my travel through the world of halftoning will be the implementation of Weighted Voronoi Stippling as described in Adrian Secord’s 2002 paper. This method is more involved than the ones I detailed in my previous blog post, however the results are quite interesting. Again, I did the implementation in Go. Notice the black dot in the middle of the white square? Overview I found a fair amount of resources about the method, most of them being implementations of Adrian Secord’s paper.

Grid paintings à la Mondrian with JavaScript

maxhalford25@gmail.com (Max Halford) — Sat, 04 Mar 2017 00:00:00 +0000

I was at a laundrette today and had just finished my book so I had some time to kill. Naturally I devised an algorithm for generating drawings that would resemble the grid-like paintings that Piet Mondrian made famous. With the benefit of hindsight I guess I could indulge in saner activities while waiting for my laundry to dry! I went through different ideas but in the end I settled on a recursive approach.

A short introduction and conclusion to the OpenBikes 2016 Challenge

maxhalford25@gmail.com (Max Halford) — Thu, 26 Jan 2017 00:00:00 +0000

During my undergraduate internship in 2015 I started a side project called OpenBikes. The idea was to visualize and analyze bike sharing over multiple cities. Axel Bellec joined me and in 2016 we won a national open data competition. Since then we haven’t pursued anything major, instead we use OpenBikes to try out technologies and to apply concepts we learn at university and online. Before the 2016 summer holidays one of my professors, Aurélien Garivier mentioned that he was considering using our data for a Kaggle-like competition between some statistics curriculums in France.

Challenge Big Data @ TSE

maxhalford25@gmail.com (Max Halford) — Mon, 09 Jan 2017 00:00:00 +0000

Halftoning with Go - Part 1

maxhalford25@gmail.com (Max Halford) — Sun, 27 Nov 2016 00:00:00 +0000

Recently I stumbled upon this webpage which shows how to use a TSP solver as a halftoning technique. I began to read about related concepts like dithering and stippling. I don’t have any background in photography but I can appreciate the visual appeal of these techniques. As I understand it these techniques were first invented to save ink for printing. However nowadays printing has become cheaper and the modern use of these technique is mostly aesthetic, at least for images.

Predire la disponibilité des Velib' @ Toulouse Data Science Meetup

maxhalford25@gmail.com (Max Halford) — Wed, 30 Mar 2016 00:00:00 +0000

Recursive polygons with JavaScript

maxhalford25@gmail.com (Max Halford) — Fri, 25 Mar 2016 00:00:00 +0000

I like modern art, I enjoy looking at the stuff that was made at the beginning of the 20th century and thinking how it is still shaping today’s style. I’m not an expert, it’s just a hobby of mine. I especially like the Centre Pompidou in Paris, it’s got loads of fascinating stuff. While I was going through the galleries it struck me that some of the paintings were very geometrical.

The Naïve Bayes classifier

maxhalford25@gmail.com (Max Halford) — Thu, 10 Sep 2015 00:00:00 +0000

The objective of a classifier is to decide to which class (also called label) to assign an observation based on observed data. In supervised learning, this is done by taking into account previous classifications. In other words if we know that certain observations are classified in a certain way, the goal is to determine the class of a new observation. The first group of observations on which the classifier is built is called the training set.

An introduction to genetic algorithms

maxhalford25@gmail.com (Max Halford) — Sun, 02 Aug 2015 00:00:00 +0000

The goal of genetic algorithms (GAs) is to solve problems whose solutions are not easily found (ie. NP problems, nonlinear optimization, etc.). For example, finding the shortest path from A to B in a directed graph is easily done with Djikstra’s algorithm, it can be solved in polynomial time. However the time to find the smallest path that joins all points on a non-directed graph, also known as the Travelling Salesman Problem (TSP) increases exponentially as the number of points increases.

Setting up a droplet to host a Flask app

maxhalford25@gmail.com (Max Halford) — Tue, 14 Jul 2015 00:00:00 +0000

After having worked for some weeks on the OpenBikes website, it was time to put it online. Digital Ocean seemed to provide a good service and so I decided to give it a spin. Their documentation is quite good but it doesn’t cover exactly everything for setting up Flask. In this post I simply want to record every single step I took. OpenBikes is a project with a Flask backend and a few upstart jobs.

Visualizing bike stations live data

maxhalford25@gmail.com (Max Halford) — Wed, 03 Jun 2015 00:00:00 +0000

Recently some friends and I decided to launch openbikes.co, a website for visualizing (and later on analyzing) urban bike traffic. We have a lot of ideas that we will progressively implement. Anyway, the point is that all of it started with me fiddling about with the JCDecaux API and the leaflet.js library and I would like to share it with you. Shall we? Presentation In this post I want to show you the tools and the code to get a fully functional website for visualizing live data.