Links

Table of contents

Smart people

Tim Salimans on Data Analysis
Randal Olson
Sam & Max – French and NSFW!
Sebastian Raschka
Clean Coder
Pythonic Perambulations
Erik Bernhardsson
otoro
Terra Incognita
Real Python
Airbnb Engineering
No Free Hunch
The Unofficial Google Data Science Blog
will wolf
Edwin Chen
Use the index, Luke!
Jack Preston
Agustinus Kristiadi
DataGenetics
Katherine Bailey
Netflix Research
inFERENce
Hyndsight – Rob Hyndman is a time series specialist.
While My MCMC Gently Samples
Ines Montani – by one of the founders of spaCy.
Stephen Smerity
Peter Norvig
IT Best Kept Secret Is Optimization – By Jean-Francois Puget, aka CPMP.
explained.ai
Better Explained
Genetic Argonaut
pandas blog
Towards Data Science
Probably Overthinking It
Simply Statistics
Practically Predictable
koaning – by Vincent Warmerdam, who made calmcode
blogarithms
Possibly Wrong
FastML
Parameter-free Learning and Optimization Algorithms
Todd W. Schneider – This guy is really good at exploratory data analysis.
Yann Thaddée – Not directly related to data science but interesting nonetheless.
Colins Blog
Fabien Sanglard – nothing to do with data science, but such good taste!
The Glowing Python – By the creator of MiniSom, which is worth checking out too.
Matt Hancock
Francis Bach – Someone with an h-index of 80+ who takes the time to blog is worth reading.
Gwern Branwen – Cool in a weird way.
Libres pensées d’un mathématicien ordinaire
Count Bayesie
Jim Savage
Nick Higham – A lot of well explained algebra.
Calmcode – Not a blog per se, but a nice collection of short to the point tutorials about various tools.
Chris Said
Evan Miller
Eric Jang
Andrey Akinshin
Single Lunch
Freakonometrics
Martin Daniel
Chris Kiehl
ithaka.im – A guy I met who travelled for 6 years with his wife on a bike, very inspiring.
Muthukrishnan – Has written some neat document processing stuff.
Björn Ottosson
Guilherme Duarte Marmerola
Cal Paterson
Claire Carroll
Luke Metz – Luke is working on the niche topic of meta-learning at Google. He also happens to a very kind person.
Practical Recommendations – A blog about recommender systems.
Robin Linacre – Some good stuff related to record linkage.
Neal Lathia – Machine learning in production stuff.
John D. Cook
Brandon Roberts
Allen Downey
Christophe Blefari
Scott Rome
Eugene Yan
Lj Miranda
death and gravity – Great advanced Python resource.
The Shape of Data
IDEA
Shaded relief
Leslie Lamport
Curtis Miller
Naftali Harris
Laird Breyer – wrote some cool software for text classification called dbacl, and markovpr which is a PageRank implementation.
Vicky Boykis – the OG behind Normconf
Danielle Navarro
Amit Patel – visual explanations of algorithms used in games.
nogilnick
Nat Bullard – known for making annual presentations on the state of decarbonization.
Max Woolf
Yuin Chien – does design at Google.
Matt Webb – this madlad has been blogging since 2000.
Caleb Gross
Geoffrey Litt
Clayton Ramsey
Rusty Conover
Austin Z. Henley

Machine learning

The Elements of Statistical Learning - Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
Machine Learning - Tom Mitchell – I think this wonderful textbook is under-appreciated.
Artificial Intelligence: A Modern Approach - Russel & Norvig
mlcourse.ai – Of all the introductions to machine learning, I think this is the one that strikes the best balance between theory and practice.
Machine learning cheat sheets - Shervine Amidi
Kalman and Bayesian Filters in Python - Roger Labbe – Kalman filters are notoriously hard to grok, this tutorial nicely builds up the steps to understanding them.
CS231n Convolutional Neural Networks for Visual Recognition - Stanford
Algorithmes d’optimisation non-linéaire sans contrainte (French) - Michel Bergmann
Graphical Models in a Nutshell - Koller et al.
Rules of Machine Learning: Best Practices for ML Engineering - Martin Zinkevich – You should read this once a year.
A Few Useful Things to Know about Machine Learning - Pedro Domingos – This short paper summarizes basic truths in machine learning.
How to Write a Spelling Corrector - Peter Norvig – Magic in 36 lines of code.
MCMC sampling for dummies - Thomas Wiecki
Your Easy Guide to Latent Dirichlet Allocation
An Intuitive Explanation of Convolutional Neural Networks - Ujjwal Karn
An overview of gradient descent optimization algorithms - Sebastian Ruder
How to explain gradient boosting - Terence Parr and Jeremy Howard – A very good introduction to vanilla gradient boosting with step by step examples.
Why Does XGBoost Win “Every” Machine Learning Competition? - Didrik Nielsen – This Master’s thesis goes into some of the details of XGBoost without being too bloated.
Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations
The Cramér-Rao Lower Bound on Variance: Adam and Eve’s “Uncertainty Principle” - Michael Powers
A Concrete Introduction to Probability (using Python) - Peter Norvig – Extremely elegant Python coding.
The Hungarian Maximum Likelihood Trick - Louis Abraham
Machine Learning for Signal Processing - University of Illinois
Gaussian Process, not quite for dummies - Yuge Shi – Gaussian processes are quite difficult to understand (at least, for me) but Yuge gives some great visual intuitions.
Frequentism and Bayesianism: A Python-driven Primer - Jake VanderPlas
Variational Inference: A Review for Statisticians - David Blei and his flock
The Performance of Decision Tree Evaluation Strategies - Andrew Tulloch
Simplifying Graph Convolutional Networks - Felix Wu et al. – A nice example of putting the horse before the cart.
MIT 6.867 machine learning course notes - Tommi Jaakola – For people who enjoy concise mathematical notation.
A Recipe for Training Neural Networks - Andrej Karpathy
The Bitter Lesson - Richard Sutton
Introduction to Locality-Sensitive Hashing - Tyler Neylon
Transformers from scratch - Peter Bloem
A Machine Learning Primer - Mihail Eric – A good read for beginners in machine learning algorithms.
Fitting Bayesian structural time series with the bsts R package - Steven L. Scott
Super Fast String Matching in Python - Chris van den Berg
Poisson regression and non-normal loss - scikit-learn
Perfect lung cancer detections in a $1 million ML competition with an ingenious hack - Yusaku Sako
Beware Default Random Forest Importances
From RankNet to LambdaRank to LambdaMART: An Overview
Word2Vec Tutorial - The Skip-Gram Model - Chris McCormick
Causal Inference for The Brave and True
Data Distribution Shifts and Monitoring - Chip Huyen
Creating Confidence Intervals for Machine Learning Classifiers - Sebastian Raschka
Machine Learning @ VU
SVD Image Compression, Explained - Dennis Miczek
Classifying all of the pdfs on the internet - Santiago Pedroza
Weird Kaggle, the superiority of books, and other reflections - Nick Griffiths
Transformers from scratch - Peter Bloem
Explaining RNNs without neural networks
Practical Deep Learning for Coders
What Is ChatGPT Doing… and Why Does It Work?
CARTE: toward table foundation models - Gaël Varoquaux
Everything you always wanted to know about extreme classification - Microsoft Research – I love the idea that recsys can be framed as an extreme classification problem.

Data science

Make for data scientists - Paul Butler
Tidy Data - Hadley Wickham – You need to be aware of this framework if you want to be serious about analysing tabular data.
Modeling marketing attribution - Claire Carroll – I worked on this problem for a short time at Alan. I definitely would have done a better job if I had read this article first.
Darts, Dice, and Coins: Sampling from a Discrete Distribution - Keith Schwarz
Unprojecting text with ellipses - Matt Zucker – See also this article on page dewarping by the same author.
Language models, classification and dbacl - Laird A. Breyer – Machine learning on text with a UNIX philosophy.
Teaching An Old Dog A New Trick - Chris Kamphuis
Optimal Peanut Butter and Banana Sandwiches - Ethan Rosenthal
The Data Science Hierarchy of Needs - Monica Rogati
Tuesday Changes Everything - Jesper Juul
Doing Named Entity Recognition? Don’t optimize for F1 - Christopher Manning – A rather niche topic, but well explained.
Lessons learned building an ML trading system that turned \$5k into \$200k
Common statistical tests are linear models (or: how to teach stats) - Jonas Kristoffer Lindeløv
Kelly Can’t Fail - John Mount

Analytics engineering

Mathematics

Physics

Data engineering

Emerging Architectures for Modern Data Infrastructure
What your data team is using: the analytics stack - Technically – Another solid article to understand what an analytics stack looks like in 2021.
Multiworld Testing Decision Service: A System for Experimentation, Learning, And Decision-Making
Machine Learning: The High-Interest Credit Card of Technical Debt - Google
Continuous Delivery for Machine Learning - Martin Fowler
Hidden Technical Debt in Machine Learning Systems - Google
The Log: What every software engineer should know about real-time data’s unifying abstraction - Jay Kreps
Command-line Tools can be 235x Faster than your Hadoop Cluster - Adam Drake
Git scraping, the five minute lightning talk - Simon Willison – I wish I had thought about this first!
Gently down the stream - Mitch Seymour
Turning the database inside-out with Apache Samza
The Snowflake Elastic Data Warehouse
Differential Dataflow – also see the Naiad paper
Time, Clocks, and the Ordering of Events in a Distributed System - Leslie Lamport
How Query Engines Work
Building a cost-effective analytics stack with Modal, dlt, and dbt – prime example of what a modern analytics stack looks like in late 2024.
Should Your Data Warehouse Have an SLA?
Dimensional Modeling Techniques - Kimball Group
The power of interning: making a time series database 2000x smaller in Rust - Guillaume Endignoux – this guy takes the git scraping pattern really far, I like his taste.
Functional Data Engineering — a modern paradigm for batch data processing – I strongly believe in this approach.

Inspiring data analysis

Bayesian Rock Climbing Rankings - Ethan Rosenthal
Is Seattle Really Seeing an Uptick In Cycling? - Jake VanderPlas
How we changed our roof and cut 1.5 tons of CO2e - Martin Daniel
WWW: Who Will Win? - Peter Norvig
Wealth shown to scale - Matt Korostoff
Are Pop Lyrics Getting More Repetitive? - Colin Morris
Tracking the Fake GitHub Star Black Market - Fraser Marlow, Yuhan Luo, Alana Glassco
Why the super rich are inevitable - The Pudding – Really cool dataviz.
Kaggle contest on Observing Dark World - Cam Davidson-Pilon – If you’re doubtful about the power of Bayesian machine learning, then read this and get mindblown.
looria.com/reddit – This is a website that aggregates informal product reviews found on Reddit. There’s a bunch of cool NLP stuff going on behind the scenes. For instance here’s recommendations for cycling and camping gear.
Who is the average nomad? – feeds from NomadList live data.
Every Noise at Once – uses PCA to map music genres.
How Big is YouTube? - Ethan Zuckerman
NYC Taxi Rides viz
Mario meets Pareto - Antoine Mayerowitz
We mapped weather forecast accuracy across the U.S. Look up your city
Resurfacing the past - a madlad decides to pinpoint all the ships that sank during WWII.
The closer to the train station, the worse the kebab - James Pae
Winners of the $10,000 ISBN visualization bounty - Anna’s Blog

Sustainability

The Limits to Growth - Donella Meadows – it’s not very often that a paper is so accurate in its predictions.
Consumer Hardware Carbon Reduction Guide - Google
The LCA paradox - Frida Røyne
Scope 3 Data in LCA of organisations Between Simplification, Overwhelming & Greenwashing
Climate TRACE
Can the economy become fossil free? - Jean-Marc Jancovici
Forget Shorter Showers
Conditional Optimism: Economic Perspectives on Deep Decarbonization - Michael Grubb
Climate Change: a practical guide

Data sources

API Rank
Finding Undocumented APIs
bigquery-public-data
fh-bigquery
Wikidata Query Service
New-York City transport data
Reverse Engineering Bumble’s API – a fun/scary API reverse engineering example that worked in 2020
ccxt – access cryptocurrency exchanges’ APIs
Our World in Data
Beyond the route: Introducing granular MTA bus speed data
csvbase

Data visualization

Datawrapper – great way to produce professional looking charts and tabels.
SlidesCodeHighlighter
How to use a histogram as a legend in {ggplot2}

Food for thought

SQL

The Best Medium-Hard Data Analyst SQL Interview Questions – There are some great interactive SQL tutorials out there, such as SQLBolt and Select Star SQL, but this one takes the cake due to its complexity. The Ultimate SQL guide is a comprehensive guide made with Count.
Bypassing airport security via SQL injection – A ~~fun~~ dangerous example of what can happen when you don’t sanitize your inputs.
SQLook
SQL Noir – this hits the sweet spot between two things I enjoy. Try listening Bohren & Der Club Of Gore while doing the exercises.

Programming

Writing

Common Bugs in Writing
Novelist Cormac McCarthy’s tips on how to write a great science paper - Savage and Yeh
How to Build an Economic Model in Your Spare Time - Hal R. Varian – The academic wisdom in this article goes beyond the world of economics.
The Double-Entry Counting Method - Great example of documenting a technical concept.
Technical discussions are hard; a few tips
Octavia Butler’s Advice on Writing

Web development

Visual design rules you can safely follow every time - Anthony Hobday – Good follow-up to Web Design in 4 minutes by Jeremy Thomas.
Typography in ten minutes
alpine.js – I usually go to Vue.js for web dev, but my brother made me realize alpine.js is a great alternative for small projects.
Hot Page – looks like a good idea to create a landing page.
uchū – decent default color palettes.
Val Town

Building a product

Beautiful Polished Rocks - Steve Jobs – the best metaphor for product design I’ve ever heard.
Stevey’s Google Platforms Rant – insights about product design at GAFAs.
Jeff Bezos on the disagree and commit principle
The Duolinguo Handbook

Language models

The surprising ease and effectiveness of AI in a loop

I don’t have a clue but it looks cool

Eye candy

Tyler Hobbs – The god of generative arts.
Some Jean Giraud stuff
Mauro Martins
A new way to knit by Petros Vrellis
A fascinating article about Manolo Gamboa Naon
Some Ukiyo-e
Turtletoy
Dwitter
generated.space
Pixel art by Marcus Blättermann
Nick Barnes’ football bible
Simon Stålenhag
Syd Mead (who worked on Blade Runner)
Michael Fogleman’s blog
World of Warcraft art by Dreamwalker
Hors-sol de AKOREACRO
Erica Anderson
Jack Sharp
Archillect – An AI that curates cool pictures, how awesome is that?
Martin Kleppe
Zoomquilt
lossfunctions.tumblr.com – Yes, that’s a thing.
Shirts of Peter Norvig
United Airlines ads by Cream Electric Art
Miniature Calendar by Tatsuya Tanaka – Broccolis that look like trees, staples that look like workout benches… I love it!
sandspiel
Jorge Jacinto
WaveFunctionCollapse
Owen D. Pomery
19th Century French Artists Predicted The World Of The Future In This Series Of Postcards
Blog maps
Decktwo
eycndy.com
Fred’s ImageMagick Scripts
Ditherpunk - Surma
Visually stunning math concepts which are easy to explain - StackExchange
Cars, bars and burger joints: William Eggleston’s iconic America – in pictures
Spectrolite
RamenHaus
SportsNetUSA.net
readcomiconline
MUBI
La vida en viñetas
Plotting 3 years of hourly data in 150ms
What I’ve learned about flow fields so far
Dear Data
FAA Aviation Maps
Floor796
John Martin
marimekko.com
Wacław Szpakowski’s rhythmical lines
10,946: a Year-Long Post-It Note Animation
Mystical
Teenage Artist Virginia Frances Sterrett’s Hauntingly Beautiful Century-Old Dreamscapes for French Fairy Tales

Pretty websites

I like these retrocool websites:

learntarot.com
hyperphysics.phy-astr.gsu.edu
norsys.com
joerick.me – the legend to which we owe cibuildwheel

Cool

WindowSwap
Radio Garden
Every Noise at Once
Starlink Satellites Tracker
Based Cooking
ReadComicOnline – I recommend these French comics.
Same Energy
BOOOOOOM
indieblog.page
Cloudhiker
Fish doorbells! Historic sandwiches! 50 of the weirdest, most wonderful corners of the web
Marginalia
Anna’s Archive
Browser games – these are made by a single doujin developer called Kenta Cho
Pong wars
Hundred Rabbits
Internet Phone Book
Cameron’s World

Documentaries