Coding with AI Agents
Coding with AI agents is here to stay. I've been using them for a while and wanted to share where they work well, where they still fall short, and how to get the most out of them.
Intro
AI agents are excellent at reproducing the syntax and grammar of code. They can quickly write things that look like code, and unlike with human language, they can run it to confirm that it produces the desired effect. This immediate feedback loop has allowed the optimization of code generation in ways impossible with human language generation. While there are debates about whether AI agents truly understand code in the abstract1, it is obvious that they can quickly write code that suffices for a wide variety of prompts.
The problem with agents is that they often act like a junior dev who hasn't bothered to read the rest of your code or follow your company's standards; they just want to get the work done. Worse, they lack the context you could assume of even a junior hire: no access to your codebase history, code reviews, internal documentation, or lunch table conversations. They are also wont to take shortcuts: I have found them attempting to edit my linting rules so that they don't have to follow them, catch an import exception on a library and import something else, skip tests that they don't like, or claim they don't need to deal with pre-existing failures they were originally tasked with fixing.
As such, I avoided AI-written code for a long time. It was great for exploring, but the code quality was often so poor that it took me more time to fix it than it did to write things from scratch. However, in the past six months with the advances in the latest models, improvements in how agents can now interact with code, and my development of a deeper understanding of how to harness agents, I've found myself now writing the majority of my code via agents. I expect them to continue to improve, but wanted to provide my thoughts on agents as I see them now and give advice on how you can best take advantage of their skills while minimizing their downsides.
AI agent abilities
Modern flagship AI agents are excellent at well-scoped, repetitive tasks for which they have large amounts of training data and context. They work flawlessly at recalling simple things, like Harry Potter's middle name (James), the syntax of Python's concurrent.futures.ProcessPoolExecutor functions, or the equations for spherical Bessel functions. They can easily perform simple tasks like renaming variables, moving code, and updating imports, or writing docstrings without the need for careful direction (e.g. ⟩ move files "save_*" to the save/ folder and drop save from the name", and it will also update all files that reference them). They can also perform more complicated tasks like writing regular expressions, making code more idiomatic, or running tests and updating values without additional human involvement.
Agents are also excellent when working in someone else's codebase. Sometimes I come across open-source code that I need help understanding or needs significant improvement. I can ask an agent about which functions I should use, and what the side effects are, and then use it to write code with those functions. If the code is an open-source Python script, I often run a linter on it and have an agent fix any issues that it flags and add type hints to make understanding easier (there are also traditional programs that can generate type hints as well, e.g. pyannotate and pyrefly infer).
More complicated tasks require greater skill on the part of the user. Prompting an agent to "build a website" gives it nothing to work with: a static site generator is the wrong choice for a banking platform, and Drupal is overkill for a blog2. For complicated tasks, the user needs to think carefully about the end goal and how it should be achieved before prompting the agent, otherwise it will go off in the wrong direction.
Personal experience
I use small AI agents for trivial tasks: locating things in a codebase, searching my config files, reminding me of obscure vim commands. Whatever free model is available on OpenCode often suffices. This is vastly superior to web-based chatbots, since the agent has access to all of my config files (warning: unless you run an agent completely locally, assume the model provider effectively knows everything on your computer).
For writing code I use the leading models. Unlike searching or explaining, code generation requires significant upfront work to point the agent in the right direction and establish norms around libraries, code style, testing, and documentation. When I skip that, the agent tends to go down rabbit trails and produce a big ball of mud that fails at what I actually wanted.
One of my first successful end-to-end AI coding experiences was for the production of graphics using the vega-altair library. I'm a long-time matplotlib user (e.g. my spectra plotting library), but I have many gripes with matplotlib's OOP style and mutable state. I wanted to switch to a more performant and declarative library for plotting, and chose vega-altair, but it has a completely different syntax from matplotlib. While it is very readable, I didn't want to have to memorize a new syntax just to make some graphics. I'd previously just copy-and-pasted code between my editor and my favorite chatbot, but this is ergonomically frustrating and doesn't scale well beyond scripts. Switching to an agent to write the code and then personally reading the edits made my transition to vega-altair easy. I didn't need to know the magic incantations to adjust the legend, change the stroke style, or add tickmarks; I could get the agent to write the actual code and view the results, tweaking things as necessary. Once the agent was finished with its edits, I could read through the code to make sure that everything was done as expected (one should always check that the agent doesn't accidentally throw out half of your data to make the process more efficient).
This didn't mean I could turn my brain off, point it at my data, and tell it to make graphics. I still had to help it understand what data is important, what correlations need to be plotted, and how to make the graphs look good. The graphs it would suggest I should make would often be useless, but its intricate knowledge of the syntax (and ability to write it quickly) was the valuable part. With an agent, I could quickly look through my data, and pull out the interesting things (such as in my GPU4PySCF blog), while not worrying about my fat fingers typing the wrong thing or having to look up the specific syntax.
There are still things that agents are bad at. I can't stand AI-generated writing. It’s not the clichés, it’s the bromides—in today’s fast-paced world of ever-evolving writing tools, I don’t need to delve deeply into their writing to give a nuanced description of all the nonsense phrasing—it’s just bad, but not for the reason many people think. The writing can be made to sound passable if you outline what you want to say and have the agent fill it out, but you will often spend more time trying to twist the agent’s words back to a semblance of good writing than if you had spent the time writing it yourself the first time, and will have learned less in the process.
Sadly, the vast majority of AI writing I see does not even try that hard. LinkedIn is full of "Thought Leaders" who do little thinking beyond pointing an LLM at an article and asking for a summary "in my voice." Agents are happy to string together words, but without deeper thought behind the writing, the result regresses to empty phrases.
Of course, I do appreciate AI tools for editing my writing and plan to integrate AI agents more tightly into my writing process. Agents can be very helpful for updating spelling and grammar, and even editing when used well (for more, check out this excellent article on AI writing by Jasmine Sun). However, I don’t think you can turn off your brain and get an AI to replace your writing (yet). Additionally, writing is an important part of the thinking process, and replacing thinking in the short term will hamstring you in the long term.
How I use agents effectively
Given that AI agents often make mistakes and take shortcuts, I find it important to enforce strict guidelines on them. I do all of my coding inside of uv- or pixi-managed projects with numerous formatting, linting, and type-checking rules that run on every commit, PR, and merge to master. This enforces a very consistent style on the code, and helps agents produce higher quality code. Many of the errors made by agents can trivially be caught by good linting and type checking. I even have many of these hooks run every time an agent touches a file, providing them immediate feedback if their edits break things and immediately correcting their formatting. If I don't add these tools, it is common for their code to slowly degrade and for agents to take shortcuts that they shouldn't (they love adding noqa to code lines). In my experience, as the codebase grows the agents will tend to mirror the style in the codebase, leading to fewer corrections from the linter (though no less need for vigilance).
I also provide agents with as much context as possible about how to write code, how to write docstrings, how to test code, and how to assist me in writing generalized documentation (see How to use skills). The increased context helps keep the agents in line and on the right path. Explicitly telling them what they should do and what is bad code with simple examples quickly improves the quality of the code the agent produces.
I almost always start in planning mode to map out what I want to accomplish. I iterate with the agent to understand the problem and come up with a plan. If it starts to go too far in the wrong direction, I'll clear the context and start over with a better initial prompt. Trying to redirect agents that have gone astray is difficult since the prior context keeps pulling them back toward the wrong path.
Between tasks, I make sure to clear or compact the context as often as possible. AI agents "remember" the entire conversation (i.e. the context window), but the instructions at the beginning of the conversation slowly lose weight as the conversation lengthens. With growing context windows, I've noticed agents become increasingly forgetful of the instructions that I provide surrounding rules and norms. Near the end of context windows, it will often "forget" that I told it to avoid placing imports in functions or that I want everything in Ångström, not Bohr. I've even had it forget that it is in planning mode and start editing files.
To work on larger tasks, I explicitly instruct it to create plans that are to be executed in stages, with all checks and tests running between sections and marked off in a TODO.md file. This allows easier context clearing between stages and recovery from failures. I see the problem of growing context window vs forgetting to be the biggest problem of the next year (I won't pontificate on the best way to fix it, but I expect to see increasing use of sub-agents and sub-sub-agents that will report their compacted findings up the chain).
At the start of projects, I now spend significantly more time thinking about what I want to accomplish and writing it down. For new methods I want to implement, I'll iterate with the agent to perform research, condense that into a methodology document, have the agent write a spec from it, and then write the code. I craft skills for each of these stages (e.g. how to perform research, how to write $\LaTeX$, how to write tests), loading them in as needed to avoid bloating the context window with irrelevant instructions.
How to use skills
Agent skills let you give the agent fine-grained instructions when it needs them, without filling up its context with irrelevant commands. They can give directions on the structure of your repository, your preferred packages, how to perform research, how to document code, etc. Since Anthropic (maker of Claude) pioneered skills, they are usually stored in .claude/skills/*/SKILL.md. I've written a few for my cookiecutters, including:
/write-code- instructions on how to write code; includes instructions like use@dataclass(slots=True)and other best practices, as well as tips on formatting code/write-docstrings- instructions on how to write high-quality docstrings/write-method-docs- instructions on how to write independent method documentation that covers the math and algorithms behind methodologies/write-tests- instructions on how to write useful tests (and how not to write a bunch of useless tests)
I highly recommend adapting these and making them your own3 (see Writing agent skills), there is a lot of benefit in thinking deeply about how code should be written and organized.
How to run AI agents
I'm a Linux and Neovim guy who mostly uses a Framework laptop,4 so I almost exclusively run things via the CLI. Both the Claude and OpenCode CLI interfaces are great, but they are slightly different. OpenCode has a nice sidebar keeping track of context usage, tools used, task progress, edited files, and your location. Claude has a slightly simpler interface, but does a better job walking you through questions to update your plan. However, these observations are likely to be outdated soon, so I recommend regularly playing around with multiple options to see what you like. Both CLIs use slash commands (e.g. /models) to help you control the agent, swap models, ask questions, query usage, use skills, etc. I typically have a terminal open with the CLI-tool of choice and another terminal open so I can poke around at what the agent is doing (though I need to do that less often nowadays).
Others run agents through their IDE. This makes it easy to both view the code while the agent edits it or immediately load the file (or section of file) into the agent's context for work/debugging. I've started experimenting with this in Neovim, but often prefer the pure-cli application.
Nominally, Claude models use CLAUDE.md and .claude/ for managing the knowledge and features of the model, while other models use AGENTS.md; functionally, every agent reads every other agent's stuff, and it is unclear exactly how much they share knowledge and features (considering that Claude skills are only 6 months old, it's understandable that there isn't yet standardization).
Which AI agents I use
Ignore this section if more than three months after publishing, it will probably be woefully out of date.
The leading LLMs are run on multiple industry-grade GPUs at a time, and I haven't had much success running smaller models on my laptop with only 64 GB of RAM, so I subscribe to numerous LLMs for various purposes. For agentic coding, I use the following based on where I have available credits:
- Claude (Anthropic)
- Opus 4.6 - planning code
- Sonnet 4.6 - implementing code (my favorite model currently)
- Haiku 4.6 - implementing simple things or searching through stuff
- Gemini (Google)
- Gemini 3.1 Pro - everything
- GPT (OpenAI)
- GPT 5.4 - everything except writing documentation, but I don't like the code it produces as much and I have strong distaste for GPT's writing style (perhaps because I see it everywhere on LinkedIn)
- Others
- Big Pickle (current free model on OpenCode Zen) - implementing repetitive simple things and searching through stuff like my Neovim config
Currently, my preferred setup is to use Opus 4.6 for planning and Sonnet 4.6 for coding as Opus is better at reasoning, while Sonnet is faster and cheaper. I'm subscribed to the Claude Pro plan, but I constantly hit my 5-hour and weekly limits, so I often switch to other providers. GitHub has a large number of models available, and provides all credits at the beginning of the month. Google often pauses Gemini usage due to systems "overheating", making it frustrating to use, but I'm content with the code quality. OpenCode sometimes randomly freezes, but I haven't been able to figure out why. Claude-cli often asks for too fine-grained of permission, typically in its pursuit of not putting too much information into the context window.
How to get started
A great way to get started is to have the agent implement something you know well and of tractable size (e.g. build a method to generate molecular conformers with RDKit and then optimize with xtb). I've added agent skills to my uv and pixi cookiecutters to help you get started. Spend some time thinking about what you want to implement and write a few paragraphs of details. Enter planning mode and ask the agent what it thinks of your plan and how it would improve it. Iterate for a few rounds until its plan aligns with your goals. Then clear the context and let it implement. Once done, read through the implementation and note where it did things you didn't want.
This will clarify where the agent falls short and how to direct it better. For recurring preferences, add sections to your code-writing skill (e.g. "use polars instead of pandas", "don't use matplotlib"). Then reuse those skills on a fresh implementation of the same task and observe how the agent's patterns change. As you improve, you'll spend more time refining your skills and less time correcting the agent mid-task.
Writing agent skills
The skills in my cookiecutters are rudimentary. Writing your own will give you deeper insight into how agents "think" and how to express your goals clearly. Before starting a project, spend time optimizing the skills for what you want to accomplish; specifying preferred libraries, formatting, and success criteria goes a long way toward getting useful code without constantly fighting the agent. When I find it going off track, I'll often update my skills and restart the task.
SKILL.md files consist of a metadata section, which indicates to the agent when to use the skill, and a body, which contains the content of the skill. Anthropic has a guide to writing skills (simpler pdf version) that is well worth a read. Your agent has also read it and is the best resource for helping you develop better skills. Enter planning mode and discuss what you hope to accomplish and how you want to encourage your agent to do so. Then have it make a plan to build skills with you.
When writing skills, expressing general principles matters more than enumerating every good and bad combination. For my /write-docstrings skill, I specify a modified Google docstring style and provide one correct and one incorrect example. From that, the agent does a reasonable job staying within my preferred style.
You can bundle content with a skill, such as reference documentation, code structure, algorithm details, and papers that get loaded as needed, though I often find a basic SKILL.md is adequate. More content in a skill uses a larger amount of the context window, and repeating the same thing in multiple ways doesn't necessarily make the skill have more lasting power. Skills should endeavor to be brief, allowing them to be loaded into the context window in a timely manner and not flooding the context window with superfluous information.
Conclusion
Coding with AI agents is fundamentally reshaping the discipline of software development. It is a significant boon to already effective coders, leveraging their understanding of code design by accelerating the pace at which it can be written and rewritten. For unskilled programmers or those unfamiliar with a library or framework, it can also be a great help, enabling them to work with things they have never done before, but caution should be exercised.
So far, the true benefits of AI agents have accrued to those who already have coding expertise; the promise of an untrained programmer giving an agent a simple prompt and having it build complex infrastructure has not yet materialized. Experts can now ship much faster without needing to worry about minutiae of syntax; it does not yet free one from needing to understand the code that the agent has written, merely the need to write it. Agents also enable quick iteration on coding plans, bringing up obscure sections of code and how it will need to change to accommodate the goal; I've done refactors in ten minutes that I would never previously have considered if I had all day. Agentic coding is not an invitation to turn your brain off and have it do your work for you, but to think more deeply about the larger problems at hand without getting mired in syntax. So far, agentic coding is like a switch from hand-writing to writing on a computer; you still need to provide the content and thought, you can just go more quickly without needing to worry about hand-crafting the letters.
Sadly, I expect many of the coming AI-written codebases to be giant balls of mud. Unless care is taken to maintain a well-defined structure to the code in the face of the deluge of new features, many codebases will quickly start to drown. Shoddily implemented features will bury us, while ironically becoming load-bearing for random other parts of our projects. Yet perhaps in our code's potential devastation also lies an opportunity; agents can also be wielded to help bail us out of this coming mess with constant refactoring and code-quality improvement, but only if we take care to use them well.
e.g. this paper showing their failure when coding in esoteric languages they haven't been optimized for: arxiv.org/abs/2603.09678
BTW, I use Zola
One should never blindly run skills from others, that is a perfect attack vector and I expect we will see some very embarrassing things happen in this regard in the near future.
I also grudgingly own a Mac because they have incredible battery life, the best processors, great build quality, good speakers, etc.