Nicolay here,
most AI conversations obsess over capabilities. This one focuses on constraints - the right ones that make AI actually useful rather than just impressive demos.
Today I have the chance to talk to Dexter Horthy, who recently put out a long piece called the “12-factor agents”.
It’s like the 10 commandments, but for building agents.
One of it is “Contact human with tool calls”: the LLM can call humans for high-stakes decisions or “writes”.
The key insight is brutally simple. AI can get to 90% accuracy on most tasks - good enough for spam-like activities but disastrous for anything that requires trust. The solution isn't to wait for models to get smarter; it's to add a human approval layer for critical actions.
Imagine you are writing to a database or sending an email. Each “write” has to be approved by a human. So you post the email in a Slack channel and in most cases, your sales people will approve. In the 10%, it’s stopped in its tracks and the human can take over. You stop the slop and get good training data in the mean time.
Dexter’s company is building exactly this: an approval mechanism that lets AI agents send requests to humans before executing.
In the podcast, we also touch on a bunch of other things:
- MCP and that they are (atm) just a thin client
- Are we training LLMs toward mediocrity?
- What infrastructure do we need for human in the loop (e.g. DBOS)?
- and more
💡 Core Concepts
- Context Engineering: Crafting the information representation for LLMs - selecting optimal data structures, metadata, and formats to ensure models receive precisely what they need to perform effectively.
- Token Bloat Prevention: Ruthlessly eliminating irrelevant information from context windows to maintain agent focus during complex tasks, preventing the pattern of repeating failed approaches.
- Human-in-the-loop Approval Flows: Achieving 99% reliability through a "90% AI + 10% human oversight" framework where agents analyze data and suggest actions but request explicit permission before execution.
- Rubric Engineering: Systematically evaluating AI outputs through dimension-specific scoring criteria to provide precise feedback and identify exceptional results, helping escape the trap of models converging toward mediocrity.
📶 Connect with Dexter:
📶 Connect with Nicolay:
⏱️ Important Moments
- MCP Servers as Clients: [03:07] Dexter explains why what many call "MCP servers" actually function more like clients when examining the underlying code.
- Authentication Challenges: [04:45] The discussion shifts to how authentication should be handled in MCP implementations and whether it belongs in the protocol.
- Asynchronous Agent Execution: [08:18] Exploring how to handle agents that need to pause for human input without wasting tokens on continuous polling.
- Token Bloat Prevention: [14:41] Strategies for keeping context windows focused and efficient, moving beyond standard chat formats.
- Context Engineering: [29:06] The concept that everything in AI agent development ultimately comes down to effective context engineering.
- Fine-tuning vs. RAG for Writing Style: [20:05] Contrasting personal writing style fine-tuning versus context window examples.
- Generating Options vs. Deterministic Outputs: [19:44] The unexplored potential of having AI generate diverse creative options for human selection.
- The "Mediocrity Convergence" Question: [37:11] The philosophical concern that popular LLMs may inevitably trend toward average quality.
- Data Labeling Interfaces: [35:25] Discussion about the need for better, lower-friction interfaces to collect human feedback on AI outputs.
- Human-in-the-loop Approval Flows: [42:46] The core approach of HumanLayer, allowing agents to ask permission before taking action.
🛠️ Tools & Tech Mentioned
📚 Recommended Resources
🔮 What's Next
Next week, we will continue going more into getting generative AI into production talking to Vibhav from BAML.
💬 Join The Conversation
I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.
♻️ I am trying to build the new platform for engineers to share their experience that they have earned after building and deploying stuff into production. I am trying to produce the best content possible - informative, actionable, and engaging. I'm asking for two things: hit subscribe now to show me what content you like (so I can do more of it), and if this episode helped you, pay it forward by sharing with one engineer who's facing similar challenges. That's the agreement - I deliver practical value, you help grow this resource for everyone. ♻️