Andrej Karpathy’s autoresearch project made a lot of ripples. I guess it’s exciting because a program that modifies itself feels like actual AI. The concept of autonomous programs is not new – see genetic programming – but what we’re witnessing now is somewhat more convincing. We just got a bit closer to Neuromancer’s world, with a Turing Police that’s in charge of stopping AIs from rewriting themselves.

While we wait for AI to become self-conscious, there are mundane problems that can be solved with autoresearch. Web scraping is one of these. Indeed, turning unstructured data into knowledge is an excellent problem for LLMs to solve. But there are nuances to have in mind:

  • Calling the LLM on each HTML page will cost a lot of tokens if you have many pages to scrape.
  • Letting the LLM spit out a response and calling it a day is lazy. There has to be at least one layer of validation.
  • Web scrapers break, because they are static whereas web pages change over time.

I prototyped something I wanted to share. I’m a regular reader of Thomas Eaton’s weekly quiz he writes for The Guardian. It’s the right level of difficulty, and the topics are interesting to me. I want to scrape all past questions/answers, to create myself some Anki/Space cards, and also leverage NER techniques to dive into topics.

These quiz pages is that they aren’t always in the same format. The HTML slightly changes from time to time, which means that a regular web scraper would regularly break and need to be fixed. Prompting Claude to extract questions/answers is powerful, because I can let it worry about the page structure.

There are 10 years worth of pages to scrape, and there will be more to come, so I don’t want the agent to start from scratch each time. I therefore ask it to generate a Python script to do the extraction. I end up with an idempotent script, which I can run once a week when there’s a new quiz to scrape:

python scrape/the_guardian_weekly/scrape.py

Again, this script will eventually break. What I want to do is run it once a week, and have it fix itself when there’s an issue. This can be done via Claude’s headless mode:

claude -p "Scrape @scrape/$(filter-out $@,$(MAKECMDGOALS))/scrape.py" \
    --bare \
    --append-system-prompt-file CLAUDE.md \
    --model haiku \
    --effort medium \
    --tools "Bash,Read,Edit,Write,Grep,Glob" \
    --no-session-persistence \
    --output-format json \
    --dangerously-skip-permissions \
    --verbose \
    --max-budget-usd 5

Headless mode is cool because it allows running an agentic session without any human in the loop. It’s practical when you have a more or less deterministic task for the agent to perform.

Note I’ve thrown some quality control into the system prompt:

After running the script, use jq ‘.[-1]’ to extract and review only the last entry from the source’s questions.json. Do not read the entire file. Verify that questions and answers are properly paired and make sense. If the output looks wrong, delete the bad entry from the JSON file, fix scrape.py to handle the page format correctly, and re-run.

This way, whenever the script breaks or outputs bogus question/answers, the agent takes over and fixes the script.

If the parsing script goes through, the only addition are new rows in the questions.json file, which contains all question/answers parsed so far. If it fails, the scrape.py script is edited too. In both cases a pull request is opened for me to review once a week. This way there’s a second review which only takes me a few seconds.

I like this pattern because I have the choice of running the parsing script manually, or wrapping it with a harness – Claude Code in this case. With the harness, the script can “evolve” autonomously, which is one hell of a step forward.