@lain can you tell me a bit about how you go about your agentic coding workflow, or just give me some samples of prompts you'll use to get codex/claude code to work for you, especially for a long period? i don't think i'm being specific enough about what i want accomplished (and how), and i think it would be helpful to hear what you find does and doesn't work to get useful outputs if you wouldn't mind sharing!
Timeline
Post
Remote status
Replies
6
@karolat frankly claude can't work for so long autonomously, even opus 4.6, but gpt 5.2 xhigh can (codex 5.3 can kinda do as well, but it's not as good). a few big things that keep it on-rails for me, in agents.md:
- tell it to do real TDD (red, green, refactor). this removes 80% of "oops i wrote the code but it doesn't do what you want" type of errors.
- tell it to commit early, and commit often in small, topical commits.
- if possible, give it some real-world grounding. if you write against an api, tell it to create a mock of that api, then to automate tests against it. if you write a fediverse server, tell it to run mastodon and pleroma in a docker compose network and test against that. if you write an emulator, tell it to take screenshots after x cycles and look at them.
- after a bunch of changes, tell it to do a review / refactor / security audit run, best when started fresh. it'll find a few things, let it fix those first.
this should get you to a software quality level a few standard deviations higher than the average release.
- tell it to do real TDD (red, green, refactor). this removes 80% of "oops i wrote the code but it doesn't do what you want" type of errors.
- tell it to commit early, and commit often in small, topical commits.
- if possible, give it some real-world grounding. if you write against an api, tell it to create a mock of that api, then to automate tests against it. if you write a fediverse server, tell it to run mastodon and pleroma in a docker compose network and test against that. if you write an emulator, tell it to take screenshots after x cycles and look at them.
- after a bunch of changes, tell it to do a review / refactor / security audit run, best when started fresh. it'll find a few things, let it fix those first.
this should get you to a software quality level a few standard deviations higher than the average release.
@lain thanks, this has been helpful. especially the TDD approach. i'll need to work in that idea of creating mocks of APIs, especially since i'm currently focused on more the QA side of development
using a separate agent to review another agent's work has been helpful too, even when it comes to evaluating a plan. i iterated a few times on a plan until new agents were satisfied with it, and then codex just oneshot the app, backend, frontend, db w/ migrations, all working in a docker-compose. crazy from where claude code was a year ago
i've been using Peter Steinberger's advice (from recent Lex Fridman) of asking the agent if there's anything it would have done differently after it completes work to pick up on pain points
one thing--i recall you saying you were having codex call claude code cli to implement things in some cases? are you still doing that, and why/how?
using a separate agent to review another agent's work has been helpful too, even when it comes to evaluating a plan. i iterated a few times on a plan until new agents were satisfied with it, and then codex just oneshot the app, backend, frontend, db w/ migrations, all working in a docker-compose. crazy from where claude code was a year ago
i've been using Peter Steinberger's advice (from recent Lex Fridman) of asking the agent if there's anything it would have done differently after it completes work to pick up on pain points
one thing--i recall you saying you were having codex call claude code cli to implement things in some cases? are you still doing that, and why/how?
@karolat yeah as you said, it's crazy how much progress there has been even in the last few months.
re: calling claude from codex, i mostly use opencode.ai now which can have different subagents for things like review or ux that have different default models, so that's one way to go.
re: calling claude from codex, i mostly use opencode.ai now which can have different subagents for things like review or ux that have different default models, so that's one way to go.
@karolat oh and THINK BIG when it comes to testing, e.g, testing before would be 'let's do a few unit tests', now you can do 'write a docker compose setup with those five mock services, simulate these 50 different scenarios', just stuff that would been extremely expensive and annoying without llms.
@lain what's your breakdown of what model you use for different work? do you find opus better for just UI work and then gpt-5.2 xhigh for everything else?
also, when comparing codex 5.3 to 5.2 xhigh, are you using 5.3 at xhigh level as well?
also, when comparing codex 5.3 to 5.2 xhigh, are you using 5.3 at xhigh level as well?
@karolat opus is more lyrical, romantic, human in many ways, so if you run user-facing agent, use opus. it can also do UI much better than codex so i'd ask it for that. for general programming, by now i tend to use 5.3 codex on high or xhigh (although it's not claer which is better) and then after a few commits i'll get a review round from 5.2 non-codex on xhigh, which always finds a few issues for 5.3 to fix. that combined with TDD can pretty much do anything, at least in my usecases (and they were diverse)