Nicolay here,
I think by now we are done with marveling at the latest benchmark scores of the models. It doesn’t tell us much anymore that the latest generation outscores the previous by a few basis points.
If you don’t know how the LLM performs on your task, you are just duct taping LLMs into your systems.
If your LLM-powered app can’t survive a malformed emoji, you’re shipping liability, not software.
Today, I sat down with Vaibhav (co-founder of Boundary) to dissect BAML—a DSL that treats every LLM call as a typed function.
It’s like swapping duct-taped Python scripts for a purpose-built compiler.
Vaibhav advocates for building first principle based primitives.
One principle stood out: LLMs are just functions; build like that from day 1. Wrap them, test them, and let a human only where it counts.
Once you adopt that frame, reliability patterns fall into place: fallback heuristics, model swaps, classifiers—same playbook we already use for flaky APIs.
We also cover:
- Why JSON constraints are the wrong hammer—and how Schema-Aligned Parsing fixes it
- Whether “durable” should be a first-class keyword (think async/await for crash-safety)
- Shipping multi-language AI pipelines without forcing a Python microservice
- Token-bloat surgery, symbol tuning, and the myth of magic prompts
- How to keep humans sharp when 98 % of agent outputs are already correct
💡 Core Concepts
- Schema-Aligned Parsing (SAP)
- Parse first, panic later. The model can handle Markdown, half-baked YAML, or rogue quotes—SAP puts it into your declared type or raises. No silent corruption.
- Symbol Tuning
- Labels eat up tokens and often don’t help with your accuracy (in some cases they even hurt). Rename PasswordReset to C7, keep the description human-readable.
- Durable Execution
- Durable execution refers to a computing paradigm where program execution state persists despite failures, interruptions, or crashes. It ensures that operations resume exactly where they left off, maintaining progress even when systems go down.
- Prompt Compression
- Every extra token is latency, cost, and entropy. Axe filler words until the prompt reads like assembly. If output degrades, you cut too deep—back off one line.
📶 Connect with Vaibhav:
📶 Connect with Nicolay:
⏱️ Important Moments
- New DSL vs. Python Glue [00:54]
- Why bolting yet another microservice onto your stack is cowardice; BAML compiles instead of copies.
- Three-Nines on Flaky Models [04:27]
- Designing retries, fallbacks, and human overrides when GPT eats dirt 5 % of the time.
- Native Go SDK & OpenAPI Fatigue [06:32]
- Killing thousand-line generated clients; typing go get instead.
- “LLM = Pure Function” Mental Model [15:58]
- Replace mysticism with f(input) → output; unit-test like any other function.
- Tool-Calling as a Switch Statement [18:19]
- Multi-tool orchestration boils down to switch(action) {…}—no cosmic “agent” needed.
- Sneak Peek—durable Keyword [24:49]
- Crash-safe workflows without shoving state into S3 and praying.
- Symbol Tuning Demo [31:35]
- Swapping verbose labels for C0,C1 slashes token cost and bias in one shot.
- Inside SAP Coercion Logic [47:31]
- Int arrays to ints, scalars to lists, bad casts raise—deterministic, no LLM in the loop.
- Frameworks vs. Primitives Rant [52:32]
- Why BAML ships primitives and leaves the “batteries” to you—less magic, more control.
🛠️ Tools & Tech Mentioned
📚 Recommended Resources
🔮 What's Next
Next week, we will continue going more into getting generative AI into production talking to Paul Iusztin.
💬 Join The Conversation
I will be opening a Discord soon to get you guys more involved in the episodes! Stay tuned for that.
♻️ Here's the deal: I'm committed to bringing you detailed, practical insights about AI development and implementation. In return, I have two simple requests:
- Hit subscribe right now to help me understand what content resonates with you
- If you found value in this post, share it with one other developer or tech professional who's working with AI
That's our agreement - I deliver actionable AI insights, you help grow this. ♻️