S3E1

#048 Why Your AI Agents Need Permission to Act, Not Just Read

May 11, 2025 · 57:03

Nicolay Gerold: 00:14

Welcome back to How AI is Built. Today, we host Dexter Horthy in a therapy session for anyone who is drowning in 30 step tool chains. Dexter is the founder of HumanLayer and recently wrote a piece on the 12 factor agents, which is like the 10 commandments. And we talk about why MCP servers are really skinny clients and why it's okay, why human in the loop is the key to enable agents to write instead of just read? So agents that can act but still have to ask first, and that's what Dexter's building at human layer, and also how to kill token bloats in agents.

Nicolay Gerold: 00:54

Let's do it. I have been really like, I have been doing way more with MCP just based on your comment, and it's really interesting. Like, also, like, what you mentioned with the road map that it hasn't been, like, evolved in any way. Like, there's nothing.

Dexter Horthy: 01:19

The road map for MCP?

Nicolay Gerold: 01:21

Yeah. But they haven't adjusted it. Nothing new.

Dexter Horthy: 01:26

It's yeah. It's interesting when you go and try to build a protocol. The idea is, like, we figured this out and this is gonna be how it is for a while and that's why we built a whole protocol around it. But I don't know. The last couple days, there has been a bunch of people showing up on Twitter and x and stuff and saying, like, oh, this is this is not the protocol I would have built.

Dexter Horthy: 01:48

To paraphrase, some people said some so much more aggressive stuff, but, I I finally played with it on, like, late Friday night, like, beyond just using MCP servers, but, like, building a program that, like a CLI that actually implements and calls and discover servers. And, I don't know. The the libraries that people have built around it are cool. I think I think it works. What have you what have you been messing around with it?

Nicolay Gerold: 02:13

I have been like, in the end, I've tried most of them. I've looked at, like, how also, like, GitHub and Sales Graph have implemented their stuff. And in the end, it's like client side API wrapper. It's like my interpretation of MCP. And I think, like, this is also, like, how it should remain because then it's on, like, the actual implementation or, like, the actual implementation of, like, the server calling code has to handle all the authentication.

Dexter Horthy: 02:46

Yeah. Exactly. Yeah. A lot of people call them MCP MCP servers, but I keep saying, like, they're more like clients when you look at, like, what the code is and what it's doing. The the other thing I've seen a lot of people say that I I kind of identify with is, like, I want MCP but for non LLM code.

Dexter Horthy: 03:07

Like, I want a simple uniform way to be able to, like, discover and build clients for restful services or whatever it is. Right? Like, I'm building an agent that talks to the Vercel API and most of the code is deterministic. It's not an LLM making the calls. And I wanna be able to you know, like, there is a SDK for it, but it's a little I have to go read the docs and learn it.

Dexter Horthy: 03:31

If there was an MCP server for it that I could easily call from deterministic code, that would, I feel like that would make a lot of people's lives a lot easier. I think some people are starting to do this. But

Nicolay Gerold: 03:42

Yeah. I I've seen it yesterday, OpenControl. And I think they are basically building something that wraps your functions in your code base and basically adds, like, the MCP layer around it. And they do it, like, with all the functions. So this is really interesting in my opinion because it makes it easier to adopt it and expose, like, the functions to the LLM.

Nicolay Gerold: 04:12

But yeah. It's Yeah. I I haven't checked out how they host it yet. So they mentioned, like, it deploys to AWS Lambda workers or containers automatically. So it's by the guy who built SST.

Nicolay Gerold: 04:25

Yeah. But I I still have to check it out.

Dexter Horthy: 04:28

Interesting. Okay. I the other thing with MCP is, like, I'm I'm wondering how I put this? Like, how do you think that we should solve auth for that?

Nicolay Gerold: 04:45

I think, like, MCP shouldn't handle it, to be honest. Like, that's my current assessment because it Like,

Dexter Horthy: 04:51

it shouldn't be in the protocol. And, like No. If you are able to access the server over standard IO or whatever, then you should be assumed to be authenticated however the client is authenticated.

Nicolay Gerold: 05:02

Yeah. If you're authenticated, like, you as a user because I assume at the moment that MCP will remain a tool that mostly is on client side and is interacting either with local tools or with tools, like, on the server through a local tool. And so the the tool itself should handle all the authentication via API keys that have to be set or via service accounts or whatever. I I don't think it would be, like, something that interesting for, like, MCP to implement it because it will open like a can of verbs of, like, different schemas of authentication.

Dexter Horthy: 05:46

Okay. So so I guess here's here's, like, a more concrete example that I'm curious to get your thoughts on. Like, let's say I have an MCP server for, let's let's say the Stripe API. Right? Stripe has an MCP server and I wanna allow my agent to, like, list customers, list products, list users.

Dexter Horthy: 06:08

Maybe you can even create customers and products because these things are, like, not quite they don't involve the handling of money. But when it wants to, like, send an invoice or send a payment link or, like, generate a new payment link or something like that, then I wanna be able to, kind of have a human in the loop for parts of that. Right? Or, like, approve it or or, like, pause it. The thing this is thing I think a lot because we think a lot about, like, human in the loop for AI that's, like, running out in the world.

Dexter Horthy: 06:36

But I'm thinking, like, should that come between the LM and the call to MCP, or do you think there's a world where that lives embedded inside the MCP server in between kind of, like, the protocol layer and the actual, like, accessing the upstream tool?

Nicolay Gerold: 06:54

I would expect, like, on setup, when you're setting up your MCP, you're configuring, like, what does it have access to? Like, what are the different servers? And on setup, like, if one of these tools needs to have, like, API keys, access tokens, whatever, you need to set them. And I think, like, tools should implement the fine grained access themselves how they see fit. So Stripe should allow you to basically generate access token service accounts for the different actions an agent could take or for the different API endpoints.

Dexter Horthy: 07:33

Okay. So you would you would be able to create maybe even multiple different MCP servers or clients or whatever we're calling them now. And one of them would have your API key that had, like, you know, full write access, and one of them would just have the read access control access by, like, which which MCPs you actually exposed to the LLM in any given context.

Nicolay Gerold: 07:54

Yeah.

Dexter Horthy: 07:55

Okay. Yeah. I like this. What how how deep have you gone on this idea of, like because because the other thing I was thinking was, like, if the MCP is gonna hold, like, authorization or permission or, like, pause to get input before it actually executes a tool, it kinda needs to maintain a little bit of state. And we kind of have to move into this, like, asynchronous world.

Dexter Horthy: 08:18

Right? Where, like, if the LM calls a tool and the MCP bounces back with, oh, we're asking for permission for that. Now the LM has to either like, the the orchestrating code has to know to pause until it gets something on a socket or a webhook. Or the LLM has to have some kind of other tool that is like, let me check the status. And it's just gonna sit there checking the status and like round tripping tokens to the LLM and checking the status again until it comes back as completed or approved or rejected or whatever.

Dexter Horthy: 08:46

But there's this idea of, like, MCP. So I I'm not advocating for that. I think it's the wrong approach. But, like, there's there's something there for, like, do you orchestrate asynchronous operations where you don't want the LM to just sit there churning through tokens, but you kinda needed to pause for some amount of time because some and it may not be, you know, looking at a human. It may be launching another job or, like, shelling out to some other agent that's gonna go churn tokens for a couple minutes, and you don't want your agent sitting there, like, waiting, listening for an answer kind of thing.

Nicolay Gerold: 09:18

Yeah. I think what like, especially in Windsurf, what they're doing, like, all the different tool calls they are doing, like, whether it fails, they just continue with, like, the plan they set in the beginning. I would assume we need like, in the task planning, we need to be a little bit smarter that we have to add metadata, like what tools are two calls, for example, are required so we basically retry when it fails. And this would also allow us to basically get something in play, which is more like durable execution, like deboss, read state, restate or temporal, where you basically maintain the state just in a SQLite database. And you have, like, the input state when the function is basically when the function enters and when it exits, basically, you maintain the state through it, and you basically clear the cache at some point, like, whenever, like, a chat is finished or whatever.

Nicolay Gerold: 10:20

And this should be, like, easy enough to implement because you're storing most of the chats anyway.

Dexter Horthy: 10:27

Okay. So you're kind of storing when you talk about, like, DBOS and temporal, those ones are really interesting because, like, and, like, even even something like long graph. Right? Long graph where you kinda have this you have your business state, right, which is your context window, your messages object, and that's customizable, but almost everybody I know is just using, a stream of OpenAI messages to go back and forth to the LM and call tools, whatever. And then you have your execution state, which is like, which graph node are we on?

Dexter Horthy: 10:54

And that's all like and like which transitions have we done? And that's all like checkpointed in a right ahead thing with with with, with SQLite or with the cloud thing or whatever you're using. Right? Like, have have you gone inside the the Langraf SQLite schema and, like, the stuff that stores in there?

Nicolay Gerold: 11:09

I have never touched Langraf. So I've always rolled my own framework when I did some workflow stuff.

Dexter Horthy: 11:17

That's fun. I yeah. I'm not I'm I am neither a fan nor a disparager of any AI framework. They're all interesting and they're all solving solving their own problems. But like, this is what I've been doing a lot of lately as well as kind of rolling my own framework because I think in in a lot of ways, code itself is already a graph.

Dexter Horthy: 11:38

Like, if you're doing if while else whatever, you're already building a DG. Like, there's a reason we used to represent like software programs as little flowcharts because it's all just a graph anyways. And so like if you want the checkpointing and that kind of stuff and you map your mental model onto specific programming model like D BOSS or temporal or I think line graph actually falls in that same category of, like Yeah. Structure your code a certain way, and you get these benefits of resumability and interrupt interruptibility.

Nicolay Gerold: 12:09

Yeah. I think, like, the durable I I really hate when you have, like, two different sets of state and they don't really fit into the same schema. Yes. So, like, this is something where I would love to see, like, some on. Like, that I have, like, one base and or we have some tools which just are, like, an intermediate, like, a middle layer, which just translate, and everyone just use the same, like, message schema that we have for, like, years.

Nicolay Gerold: 12:37

I got so annoyed when OpenAI started to basically use the developer tag instead of, like, system.

Dexter Horthy: 12:46

Oh, they have a developer they changed the name of the system message?

Nicolay Gerold: 12:50

They have a new role. Yeah.

Dexter Horthy: 12:55

This is what I get for not paying man, I love not paying attention to OpenAI launches. This is great. I I I wish you hadn't told me that, but I'm glad you have been oblivious to it up until now.

Nicolay Gerold: 13:08

It's I would say, like, it's just I want standards. And, like, the function calling stuff, in my opinion, it's like I can translate between all of those pretty easily. So I will just pick, like, when something new is built, just pick what's commonly adopted and don't annoy me to actually first understand your schema so I can write my own middle layer to basically translate between what I'm generating and what you have.

Dexter Horthy: 13:36

Yeah. I I mean, so not only do I think that, like, yes, I agree and all your all of your context should be like, I I I basically have to see this, like, for most applications, your your state of the application and the context window you're passing to and from the LLM can often be the same thing even for, like, fairly complex multi agent kind of flows. I think that and then beyond that, like, I think the context window doesn't have to be a string of like messages in the OpenAI JSON format. I think one of the biggest challenges in making agents is like keeping the context window controlled and tight and focused. And I think even for workflows that go to like twenty, thirty, 40 steps, which is where I think a lot of people start to see agents kind of break down and get lost and lose focus and kind of, like, spin through some, like, weird contact stuff, where they're trying the same broken approach over and over again.

Dexter Horthy: 14:41

You've seen this. Yeah?

Nicolay Gerold: 14:42

Yeah. Yeah.

Dexter Horthy: 14:44

So, I think in those worlds, like, if you can add an extra 10 steps of focus by being way more token efficient, I think you gain a lot. And one thing I've learned in, like, digging into, like, I don't know. There's some tools that do, like, token visualization really well. And you can see, you can see that, like, JSON is not necessarily the most toast token efficient way to manage a context window. I know once the JSON hits the LLM, they're doing a bunch of conversions and tokenizing it into, like, user and system and things like that Yeah.

Dexter Horthy: 15:17

Under the hood. But, I've found and this is a little more vibes than benchmarks, but, like, rather than having, system user tool calls, tool answers, tool calls, tool answers, tool calls, tool answers, etcetera. I just do system and then user. And in the user message, I put, like, a concise I happen to be one of the, like, people who's into doing, like, XML and prompts. But I'll just take everything that's happened, all of the, like, messages and all the tool calls and all the tool answers and, like, serialize them however you want into one user message.

Dexter Horthy: 15:53

Because at the end of the day, when you pass in a context window to an agent and say, like, call tools and or tell me you're done, one of those two things, you're basically saying, here's everything that's happened. What's the next step? And you can structure that question however you want.

Nicolay Gerold: 16:10

What I am playing around with at the moment is actually because in the end, it's just a stream of tokens. Like Right. The the LLM, it has a notion of, like, this chat model because we trained it into it through, like, the just special tokens. But I think, like, in agent stuff, we are giving it tasks to complete. So Yep.

Nicolay Gerold: 16:34

I'm thinking more in the direction. I want, like, a task description with a set of requirements, and I'm updating it as the chat goes. And everything that doesn't serve me a purpose of actually, like, making this task description better and giving me more context should just be thrown away. So I'm basically updating this task description and the context, you know, with with more and more staff as the conversation goes on. But anything irrelevant, like failed attempts and stuff like that, if it doesn't add any new value or I'm on, like, the fifth failed attempt, I would just throw it away.

Dexter Horthy: 17:07

So well, yeah. The other thing is like the more you do the same failed attempt, you're almost like few shot programming the model that it's like, oh, after this thing fails, I try it again and it fails again. And it's like training itself in that one context to continue doing the broken thing. Right?

Nicolay Gerold: 17:23

Yeah.

Dexter Horthy: 17:25

Do you ever explore like like when you talk about this, like, are you hand pruning a context window as it's growing? Like, is that the experience you want? Or is are you more like, I want the model to delete its broken parts that it thinks it doesn't need anymore?

Nicolay Gerold: 17:43

It's I'm I'm using an LLM. In the end, like, all the, like, LLM compressions, like, in agents, the stuff that came out early. I'm not sure there was, like, one Microsoft library, which was used for compressing the context window. And Yeah. I was always like usually, you build something task specific so you can give it information.

Nicolay Gerold: 18:06

Okay. What information is valuable? What do I want to keep and what do I want to throw away? And the model's job basically is to copy paste relevant stuff into the context window, I think there should be a smarter way, but it's really hard to basically like, you can't really do it algorithmically.

Dexter Horthy: 18:25

Yeah. Yeah. It's it's interesting. Yeah. I I think the one of the most interesting things in AI in the next year or so is gonna be continuing to hone in on this, like, which parts of this can humans do well versus which parts, like, can be left to the LLM.

Dexter Horthy: 18:47

And I think we're having this constant rotating of, like, oh, humans are on the chat side of this and the LLM handles everything of all the calling and tools and, like, manages its own context and stuff. And it's, like, rotating through a little bit where, like, actually, the LM's better at writing its own prompts than humans are. And humans should actually like, there are things about managing the context window that that the LM just isn't very good at.

Nicolay Gerold: 19:09

Yeah. What I'm always thinking about is, like because it's generative models. They are supposed to generate. And I think, like, the what most people aren't doing is, like, generating a lot of options, and then by your expertise and taste, picking one that's actually decent and that you like. And I think are we enforcing, like, something, like, we want really, like, deterministic outputs by doing, like, all of the supervised fine tuning in RLHF and taking, like, the creativity out of it that we would want if we generate, like, lots of options because we want diversity.

Dexter Horthy: 19:44

Yeah. I mean, have you have you found, like, specific projects where where, like, you've actually, like, baked that options and human choice of of the outputs into what you're doing?

Nicolay Gerold: 19:57

I have base I have experimented with training my own, like, LLM for writing, and for that, I use base models mostly.

Dexter Horthy: 20:05

And Okay.

Nicolay Gerold: 20:07

Just, like, continuing the pretraining with only what I've written before and pieces of writing I like. And it's, like, 90% of the outputs you can throw away, but, like, one in 10 is, like, really, really good and really hits it on the head. And Okay. I think, like, this is something really interesting that I'm a little bit missing and no one is really exploring.

Dexter Horthy: 20:34

Is is how how how I mean, do you think a model could have picked out of those 10 which one was the best or is like it had to be you?

Nicolay Gerold: 20:43

I think it could be a model, I think you would need to align it and give it like some structure of what you're looking for.

Dexter Horthy: 20:52

Okay. Yeah. I mean, I think models yeah. Go ahead.

Nicolay Gerold: 20:55

This is more of a deterministic task. Like, I want to pick the best thing. This is a deterministic task, so fine tuning and RHFing for that is pretty good. Like, but generating a text, I think, like, I'm I want the creativity. I want, like, very different options, which is really hard to get if you are using the same model, and giving the same input, slightly different instructions.

Nicolay Gerold: 21:18

Like, you're still getting pretty overlapping outputs.

Dexter Horthy: 21:22

Interesting. What were you fine tune like, were you fine tuning on your own writing? Or, like yeah. Okay.

Nicolay Gerold: 21:29

And I have, like, a bunch of, like, people I follow, which I just basically copy paste their text as well. They really like the writing style.

Dexter Horthy: 21:37

Did you try doing that with Rag, like, just, like, pushing samples into the context window? Like, did that help at all?

Nicolay Gerold: 21:44

I think for Rag, a lot of times, it then gets confused. Like, when I pull in, like, slightly related content, like, me, it's, like, mostly data and AI stuff. I pull in slightly related stuff, and it suddenly adds, like, a few sentences on, like, a different topic, which was in the context, Vinoy, and, like, which is completely different to, like, the input I fed it. And I think, like Jesus. I'm so torn on

Dexter Horthy: 22:10

knowledge when you really wanna coach it on style. Right?

Nicolay Gerold: 22:13

Yeah. I'm really torn on few shots when it's not about, like, up to date stuff or, like, knowledge based stuff, but rather it's, like, style and, like, forming the output.

Dexter Horthy: 22:28

The best prompt engineers I know are pretty anti few shot because it's like you're giving the model too much direction and you're limiting its scope of what it can again, from your what you're saying about generating and being creative.

Nicolay Gerold: 22:42

Yeah. It's

Dexter Horthy: 22:46

The the thing I've always wanted is, like, I'm curious if you tried this. It's like, I want a model that writes like me so badly. Badly. And my, like, poor man's version of this is a very, like, raggy version where it's like you just put stuff in the context. But, like, setting up a writing workspace in something like cursor, taking everything I've ever written that I like I like the tone and the style of and dropping it in the folder so the cursor will pull it into the context window and then just start writing a new article or even do like cursor auto completions or command k to be like, hey, write a paragraph about this.

Dexter Horthy: 23:18

And then it still needs it's just caught. Right? So it still needs to be heavily edited, but it can start auto completing from what's already in the document. And so in those moments of writing where you're like brain you're like, you're blanking, it can actually fill it in and be like, oh, yeah. That's what I was gonna say.

Dexter Horthy: 23:34

Or that that's what I wasn't gonna say. Like, having something to say no, that's wrong is actually super valuable in writing to be able to figure out like, okay.

Nicolay Gerold: 23:41

But what

Dexter Horthy: 23:42

I actually want is is is this versus I'm stuck. I don't know what to say here.

Nicolay Gerold: 23:47

Yeah. But it will be difficult to, like, use negative examples in the context or not as well, like negative few shots. This will be interesting. I bet it would work. Yeah.

Nicolay Gerold: 23:57

Don't write like

Dexter Horthy: 23:58

this. I don't know how you would do that. Yeah. Yeah. Just name it trash article do not copy.

Nicolay Gerold: 24:05

Yeah. I'm I'm using, like, most of my writing is in Cursor because I'm just writing a markdown block basically on ASTRO, so I write directly in the editor, so I don't have the friction of copy and pasting around. And I feel like they have some stuff in the system prompt which primes it for for coding. Yes. So I feel like when I use raw Clot, it's way better at writing because it knows, like, okay.

Nicolay Gerold: 24:32

I'm writing something and not doing anything else. In, like, cursor, I feel like they have something in there which primes it for coding or writing documentation or something.

Dexter Horthy: 24:41

I think that's probably I think that's probably right. Have you found any good writing tools that give a cursor, like, experience but for for, you know, that raw claw, like, writing moment?

Nicolay Gerold: 24:53

No. But it's like I tried sudo write, for example. I tried Lex by Avery, lex dot page. K. I tried Spiral, which is basically, like, re reformatting content.

Nicolay Gerold: 25:09

So, like, it assumes you have one piece of content, like a tweet, and you want to go to a blog post, or you have a podcast and you want to go to a LinkedIn post. And they bank really heavily on few shots, but I feel like the it's they are all using the same models, and I feel like it's really limited, like, what they can do with a few set of few shots and without really clear instructions, like, what I want to get out of it. And in my own writing, I started to do basically, I gave I I write it out, and I gave I give the same input to an LLM. And when it gets close to what I've written, I don't post it. Because I think, like, it's already on the Internet.

Nicolay Gerold: 25:54

It's like slob that's already out there. I don't need to edit. Yeah.

Dexter Horthy: 26:02

Yeah. The slob thing is is really hard. I I I I've been experiencing this thing lately with, like, both with code and new projects and with writing projects where it's like, you have a really good idea and then you're like, cool. I'm gonna ask Claude to help me build this out. And sometimes it does it and it's pretty good and you can iterate it from there and, like, make it good.

Dexter Horthy: 26:21

And sometimes it comes out so sloppy, just like Claude's having a bad day or GPT's having a bad day, and you have this, like, internal churning. I don't know what the there's there's probably some long German word for this feeling where it's like, that could have been a good idea, but instead, I gave it to a model, and the model didn't do a good job on the first try, and it didn't do a good job on the second try. And now, like, I hate everything about it, and I have to throw it away.

Nicolay Gerold: 26:47

I think, like, that's, like, this. The issue I have with LLMs, I want them to first generate options, and then I want them to basically go explore, exploit. Like, use that. In the beginning, explore, give me a bunch of options. Then when I find something I'm like I like, like, exploit heavily into that.

Nicolay Gerold: 27:05

But I think, like, directing that is really hard. There are a bunch of people who do, like, this spec driven prompting, like, generate a really detailed specification first of what you want to build and prompt the LLMs to ask, like, a bunch of follow-up questions and then automatically generate, like, prompts for the different components and what you want to build. But also with that, I never really had, like, the success. Success. Like because I think, like, if you build a bunch of stuff, you have, like, so much stuff in your head already, like how you would build it, like what tools you like, what tools you don't like.

Nicolay Gerold: 27:40

And I think, like, the LLM never gets it correctly.

Dexter Horthy: 27:44

Like, when you're building systems like that, are you just, like, hacking it together with code, or are you are you are you using some of these, like, more, like, WYSIWYG, like, agent playgrounds?

Nicolay Gerold: 27:56

To be honest, like, most of my day is just spent in Neovim or Cursor, like one of these two. And I don't really like to exit them because I think, like, staying in the, like, staying in the flow and staying in the same tool really keeps you focused on Yeah. Doing whatever you're doing.

Dexter Horthy: 28:15

Right. You switch to a window to use ChatGPT, and then suddenly you end up with with six LinkedIn tags open or something. Right?

Nicolay Gerold: 28:22

Six LinkedIn tabs and 500 Shorts in on YouTube.

Dexter Horthy: 28:28

Yeah. That's cool. Yeah. I I I mean, I I guess I'm curious just like what are you what are you building these days? Like, what's your what's what have you spent the most time, like, hacking on lately?

Nicolay Gerold: 28:40

So really hacking hacking. One of those was like this task description and task update.

Dexter Horthy: 28:46

Mhmm.

Nicolay Gerold: 28:47

I think a lot is like you can summarize it as, like, context engineering.

Dexter Horthy: 28:53

That's it. Everything is context engineering. Like, all of rag, all of agents, all of multi agent stuff, it's all, like, how do we get a good focus context to a model so that it can do a good job?

Nicolay Gerold: 29:06

Yeah. And it's like I'm always thinking, like, you have a bunch of different, like, data structures of way to present the information. One is like a profile. You basically have, like, set fields, and you're just updating them all of all the time. Then you have, like, a knowledge graph, which is just like you're capturing facts and stuff like that, and then you have, like, just raw dumps of data, which is like a vector store or, like, you have a SQLite database on top you're running some form of search.

Nicolay Gerold: 29:34

And you can build, I think, like, really much with these three components. But, usually, you have, like the important part is the metadata. Like, how do I attack the data so I can slice and dice my data in a way that creates, like, an interesting representation, like whatever I'm talking about.

Dexter Horthy: 29:52

Or the LLM.

Nicolay Gerold: 29:53

Yeah. Like a timeline, for example. Like, often it's really interesting to present like basically history of the same file. Like, if the past commits of the same file and basically show it, like, how did it evolve and then tell it, okay. I want to do this next.

Dexter Horthy: 30:11

Interesting. Okay. Because that puts it that's almost like building it's it's velocity rather than position almost.

Nicolay Gerold: 30:18

Yeah. And I think there is like so many different representations which you can craft. And I think like we barely touched the surface of figuring out like how can we best represent the information to get, like, a really good output.

Dexter Horthy: 30:33

And are you doing any kind of, like, evals or testing or harnesses around all of that? Like, is it I feel like as as a as a solo dev hacking on my own stuff on the side a lot, I have the benefit of just being like, yeah, vibes. But Yeah. Vibes. I'm curious, like, I but I also have this, like, like, deep desire to know for sure if it's not because, like, there's this thing of, like, you tried six different prompts, and you kinda get lost a little bit of, like, okay.

Dexter Horthy: 31:01

Which one worked, which is the best thing? And you kinda just, like, hope it all gets baked into your subconscious somehow so that the next time you go prompt, you're, like, you're getting to know the LLM, but it's at a, you know, deeper layer than anything that you've written down or that you could prove.

Nicolay Gerold: 31:15

Yeah. I think, like, this is a pet peeve I have. I think there are still no really good tools for, like, data labeling, even open source one. I think, like, our Agia or however it's called, like, the Hugging Face one is decent. Okay.

Nicolay Gerold: 31:28

But it takes, like, always a while to actually define the different fears. Like, what do I want as an output? Then coming up with rubrics, like, how do you want to score that shit Yeah. Is also, like, it's really difficult. There is, by the way, there's a really interesting talk on the AI engineering summit by Will something.

Dexter Horthy: 31:52

Okay.

Nicolay Gerold: 31:52

He's talking about rubric engineering, which I think was a really interesting term. And I think, like, this could also be really interesting in the future for creating automatic validations, LLM as a charge, and basically steering them in good direction by really carefully crafting rubrics for whatever output you have.

Dexter Horthy: 32:15

So the rubric is about scoring the scoring the, like, data triples that you're actually, like, gonna feed into the model as as as fine tuning data?

Nicolay Gerold: 32:30

Yeah. I would say, like, mostly scoring the output. Like, half like, how good is the output and basically breaking it down into different rubrics and scoring it along that dimensions on, like, a Likert scale.

Dexter Horthy: 32:44

That's super interesting.

Nicolay Gerold: 32:45

Yeah.

Dexter Horthy: 32:45

We've been thinking thinking a lot about, like, how can we make it really easy and, like, effortless to, like, bring in nontechnical people to help with the data labeling process. I kinda I kinda have this thesis of like, you you're building some kind of AI agent or some kind of LM task. Right? And you wanted to do something like send cold email outreach and put a bunch of slop out in the world. But you're saying like, there you go.

Dexter Horthy: 33:10

We're doing one that's actually good. It's gonna do good cold outbound. It's gonna have good context and do this thing. One of the things you can do is you can, you know, bring a bunch of salespeople who I mean, one of the things you can do is you'd outsource it to Scale AI and say, hey. I need a bunch of data that, like, makes it more like this and, like, have some outsource thing and get your dataset.

Dexter Horthy: 33:27

The other thing that we are, like, seen still exploring, but I think this works really well is is, you know, put your AI in a Slack channel with four of your best salespeople. Right? I think of the best salesperson in my last startup, if I gave him a spreadsheet of 100 data points and told him to, like, score these, read these hundred emails and tell me on a scale of one to 10, like, why they're good or not, never gonna happen. Literally never gonna happen. I could ask him every day for a year and it would not happen.

Dexter Horthy: 33:56

But if you change the mental model and you say, hey. Look. We have this AI SDR. It's gonna do your outbound for you, but it's gonna ask permission before it sends every email. It's gonna drop a note in this channel, and it's going to try to do really high quality outreach.

Dexter Horthy: 34:08

The only thing you have to do is you have to tell it when it does a bad job and why. And the the idea would be like, have people who are nontechnical but have a lot of business context. And I would love to see the sorts of systems where you just drop this thing in Slack and have the salespeople yell at it all day. And every two, three weeks, you generate enough fine tuning data. I mean, you have to change the raw Slack messages of like, hey, this is bad and here's why into fine tuning data.

Dexter Horthy: 34:36

But I think an LLM can do that. And, like, start to generate a rubric from raw human feedback. It's into Rubric engines, so there's, a couple steps in the pipeline because you gotta generate the triples and you gotta generate the, like, scores on it. But, I mean, these sort of systems are really interesting to me of, like, in real times, resteering so that you can make sure that, like, nothing quite, like, atrocious or too sloppy actually goes out, and then, you know, every couple weeks being able to fine tune or generate evals or whatever it

Nicolay Gerold: 35:07

Yeah. I think, like, it the why is probably also, like, a big kicker. Like, a lot of people will get annoyed by actually giving you a reason why it's shit. Yeah. Especially, like, after you have done, like, 50 of those, I think, like, the, like, amount of people who are still sticking with it really drops.

Nicolay Gerold: 35:25

I think, like, doing something like have you thought of that? Like, marking what's good and what's bad. Like, you're I I I have no clue whether that's possible in Slack. Did you have, like, a person, you're just dragging it over the text. Like, this part is good.

Nicolay Gerold: 35:39

This is bad. Like, he's just marking two different spots, and then you basically give, like, a thumbs down on the entire email. And then you have, like, more fine grained feedback. Like, you have good parts, but you have bad parts. And you can't even interpret it.

Nicolay Gerold: 35:54

Like, the email is declined, so the bad part is really, really bad. So you can't put, like, a weight on it.

Dexter Horthy: 36:01

Interesting. I like that. I mean, I what it comes down to is, like, I think we need there is a new generation of really good labeling interfaces that we're gonna need, whether it's like, hey. I parsed this PDF and here's all the fields I got out and a human goes through and says, like, these three are wrong. Or what you just described of, like, either where it's like, we talked to people of, like, doing transcript labeling.

Dexter Horthy: 36:24

This is a big thing into what you just described, I think, is really being a transcript labeling where it's, like, show the part mark the parts of the transcript that are wrong and change them to be correct. Yeah. But yeah, which parts of the email are slop is, is an interesting one. And like, a, I think you will if you can figure out a way to convince people that, like, hey. Doing this work will limit the amount of slop in the world, you're gonna you're gonna get a lot of fans because it's a it's a real problem.

Nicolay Gerold: 36:56

I'm always wondering, what do you think? It's like, are we with LLMs? If especially, like, the popular ones like Chekip tea, are we always converging towards mediocre?

Dexter Horthy: 37:08

The LLM or or the human race?

Nicolay Gerold: 37:11

No. The LLM. Because it's trained on feedback data in the chats with, like, the general population.

Dexter Horthy: 37:23

I don't know. It's a good question. What are you trying to say about the general pod population?

Nicolay Gerold: 37:31

That it's average. And I think, like, when I would want, like, a writing chatbot, like, I wouldn't want it to be trained on my writing. I want it to be trained on, like, the best technical writers which are out there.

Dexter Horthy: 37:44

Yeah. Interesting. Yeah. Or like, just like yeah. Just train it on on Ernest Hemingway, and that's it.

Dexter Horthy: 37:52

Just jump dump all the works of Hemingway into a fine tuning set over and over and over again. I mean, I can't see how I can say no to that question. I don't wanna say it out loud because it's depressing, but I think you might be onto something there.

Nicolay Gerold: 38:12

What do

Dexter Horthy: 38:12

we do about that?

Nicolay Gerold: 38:14

Yeah. I'm not sure what we will do about that, to be honest. I think it's like I'm really torn on the future. It's like I I would really love it, like, going to, like we have a lot of specialized models, and they are really great at something. I would really love that to happen.

Nicolay Gerold: 38:32

But I think, like, then most of these models will be stuck within corporate walls because they hold most of the, like, data around certain topics. So most of us won't be able to participate. And most of these models probably won't see daylight or get out into the general population because they will just use it internally because it's a competitive edge, which would be also good because I hope I would be hired to train the models. Or the second part is like the models just get good enough that we are stuck with good enough and people just get lazy. They don't want to trade like something additional on top.

Dexter Horthy: 39:19

Yeah. The bar is like, don't need to go from 95 to 97%. Like Yeah. It's not worth it. No one cares enough.

Dexter Horthy: 39:26

It's like, what's another example of something like that in the world? I'm trying to think. Where it's like, yeah, we could make this better, but it's fine. Probably most things in the world. You just get complacent.

Dexter Horthy: 39:39

Right?

Nicolay Gerold: 39:41

Yeah. I think, like, the Toyota production system for me is, like, the exact opposite. Like, always pushing further, always going for, like, more improvements. But I think, like, just culturally, you see that most companies weren't able to adopt it even, like, they tried, like, implementing Toyota production system, like, six sigma squared or how it's called, lean, whatever.

Dexter Horthy: 40:05

This is I mean, this is the same thing. So I I organized for, like, DevOps days and and a bunch of, like, you know, the platform engineering stuff. And, I think the reason why I think anyone who DevOps and lean and agile and scaled agile, like, all the people for whom that's gonna work have already adopted it, and everyone else has, like, who has failed is, like, it's too late. Like, that's never gonna happen. You know what I mean?

Dexter Horthy: 40:33

Like, over the last ten years, the people who were ever gonna figure it out have figured it out, and everyone else is just gonna keep doing what they were doing, and it's good enough.

Nicolay Gerold: 40:42

Maybe that's a good example. Basically, pitch is you will never have, like, the Phoenix project in your old company. It will never happen. It

Dexter Horthy: 40:50

will never happen. I mean, it's like I mean, you've just tried for ten years, and then you couldn't make it happen. So, I don't know. Maybe maybe, maybe after another, like, generation turns over, maybe try again. You know?

Nicolay Gerold: 41:03

Yeah. It's, like, the same as in science. Like, ideas get adopted one death at a time. It's

Dexter Horthy: 41:11

Wait. What is that? I've never heard that before.

Nicolay Gerold: 41:14

Didn't you hear that before? That, basically, the old beliefs have to die out. So the old regime of all the scientists have to die out, and then new ideas can blossom.

Dexter Horthy: 41:26

Interesting. Okay. I don't know. I did I did physics in undergrad, and, I think we were all we were all working so hard and so depressed that they didn't wanna share that one with us. What if I pushed some people over the line?

Nicolay Gerold: 41:43

Yeah. By the way, it's I want to talk a little bit more about, like, the read only mode you mentioned last time. I would love to know, like, why do you think, like, the read only mode is really necessary? And do you think we will actually be able to move out of it?

Dexter Horthy: 42:03

For models. Oh, yeah. Yeah. Yeah. This idea of, like, most people build agents and then they let the agents do not that interesting things.

Dexter Horthy: 42:14

Right? Like, there's all been all this hype about Manus AI. Right? And Manus is like the best researcher and he uses computers. Can figure out and can build you a report and answer any any question.

Dexter Horthy: 42:21

I'm like, I watched it for thirty seconds. I was like, I get that this is new and cool and different and better, but I'm not excited by it because I want agents that can actually go do the work for me. I am I do not wanna be copy pasting out of some chat interface into the system that actually does the work, whether it's pasting an email out of chat GPT or whatever it is. And I think a lot of people explore tool use and and agents and all this stuff, and they realize like, oh, the all on is pretty good, but it's not good enough to get this thing right more than 80 or 90% of the time. And for some actions like spamming cold blinking outbound emails, 80% is great because that's already a numbers game.

Dexter Horthy: 43:03

Right? It's like if 20% of your outbound sucks, it's like 20% of your human written outbound probably sucks. So it's like, if you can just you're just trying to get in front of all the right people and get those, you know, needle in the haystack kind of thing. Like, I think 80% accuracy is fine or even 90% quality, whatever it is. But there's a lot of things that I want agents to do that they cannot do with the 99% or 99.9% accuracy that I did they need to do to be able to for me to actually trust it to do that, like fully autonomously.

Dexter Horthy: 43:31

And I think most people have that experience of like, oh, it'd be great if this thing can update my CRM after a call that is like, oh, you know, every it kinda gets that wrong sometimes, and it's just creating chaos, and now Salesforce is even worse than it used to be. Like, I would rather have no data than bad data. Right?

Nicolay Gerold: 43:47

I'm not sure whether you can make Salesforce even worse.

Dexter Horthy: 43:52

Oh, trust me. I've I've seen it. I've seen it. But no. So it's like and you end up being like, okay.

Dexter Horthy: 43:59

I guess we'll just have it output summaries or we'll build an Oracle that people can ask questions about, hey. What's going on? No, so it's like, and you end up being like, okay, I guess we'll just have it output summaries, or we'll build an Oracle that people can ask questions about, hey, what's going on with this customer? But then you have certain have like drift, right? Where it's like, it can query the meeting things, you have some data over here, and you have some data in Salesforce and you, like, lose your source of truth.

Dexter Horthy: 44:23

And I actually think that, like, there we're so close to a breakthrough here where, like, everyone's really I get why everyone's obsessed with getting models to, like, write code and analyze data and do math. Like, these are really interesting, important benchmarks. But what they're really valuable for today is, like, moving information around and keeping everybody on the same page, which you ask any smart, talented, like, knowledge worker type person, that tends to be their least favorite part of the job. Like most engineers, they, like, they hate going into management because your job becomes, like, keep everybody on the same page, move information around, and keep everybody informed. And so I have this vision of, like, you know, you don't really want an agent to go send team updates all over your Slack for you because again, that should be very high quality that if that slot, you're screwed.

Dexter Horthy: 45:12

I'm rambling a little bit here. But basically, my take of all of this is, like, the models are almost good enough to do these right actions to do these impactful things. And if you can get it to 90% accuracy with good prompting and good context and, I mean, this is what we started. I started building all these agents for myself that worked in this way or would like shoot me a text message or shoot me a Slack message of just like, hey, I'm gonna do this. What do you think?

Dexter Horthy: 45:38

And I would 90% of the time, I would just hit approve. And 10% of the time, I would send the And that became so much better than, like, asking the LM and then copy and pasting out. It it basically, like, cursor tab auto complete for every API for every tool call in your life, basically. Of just like, hey, this thing happened in the world. Agent looked at it, started thinking about it.

Dexter Horthy: 46:03

Here's what it thinks the next step is. And nine times out of 10, it's correct. And so like, yes, I still have to do 10% of the work or let's say 11% because I have to do the work to actually hit the approve button. But I just saved I just made myself 10 times more efficient at that. And so now I can do 10 times as much of that work, whatever whatever thing it is, whether it's customer success, or support, or sales, or engineering management, or whatever it is.

Dexter Horthy: 46:26

So that's my take, maybe the model someday will get smarter. If they converge on mediocrity like you said, maybe not. That's where I think we're going.

Nicolay Gerold: 46:37

I I think, like, first, like, converging to mediocrity will be one point. And I think, like, the second trend is, like, we are chaining more and more calls. Like, we are doing task decomposition. And this means, like, even if the model has, like, a 99.9% success rate, if I chain, like, 10 calls, like, the failure rate gets increasingly higher.

Dexter Horthy: 47:00

It's exponential. Right? Because every time you go a little bit off track, you're more likely to go more off track.

Nicolay Gerold: 47:05

So I it's like, I'm not sure whether we will hit, like, this 99.9% task completion rate. I don't believe it. So I think, like, the human in the loop part will be really important in making that as frictionless as possible. And one big belief of me is, like, actually going into the tools where the people actually, like, spend their time and just like pushing it into that.

Dexter Horthy: 47:29

Yeah. People don't wanna load up your sixteenth ChatGPT inspired interface if they can just like, I don't know. I'm I'm sitting in my email inbox and all day dispatching work to agents and hearing back from them via email, and it's great because it's like it almost feels like there are digital people in our company working for us doing stuff.

Nicolay Gerold: 47:49

Yeah. We we have built something around content automation, which was really interesting to me, and we built it to be autopilot. And, like like, what you're building with human layer, we built that, like, basically pushing a message and getting approval. But then we heard, like, the exact opposite. They want an interface where they have, like, a real app editor and stuff like that, which was really annoying to me.

Nicolay Gerold: 48:14

I really like like, sticking in the back end was really beautiful to me, like, using other stuff as apps.

Dexter Horthy: 48:20

That's right. Yeah. I someone accused me recently of, like, Dex, you just wanna move all apps to work over email because of LLMs, but also because you've always hated front end dev, and now you can build an app that's entirely back end and the front end is just plain text over email.

Nicolay Gerold: 48:37

That's just beautiful.

Dexter Horthy: 48:39

Yeah. Yeah. Anyways, yeah. It's, where where are we at? The people who wanna write everything in plain text markdown and live in Vim all day and and not not look at a website.

Nicolay Gerold: 48:50

Yeah. How do you actually think about doing corrections? Because I think, like, having the interface, you can allow, like, way more corrections in a little bit more fine grained way. And I think, like, when you're in Slack, WhatsApp, whatever, you're limited usually to, like, messages. So you can't really do, like Yeah.

Nicolay Gerold: 49:10

Just, like, fine grained correction in, a function call.

Dexter Horthy: 49:15

Yeah. So so Slack builder interface, so you could let someone come in and actually edit the fields of a function call. You could do that, but I think I think you're right. I think the the best way to do this, and I think this is like, again, this is fine tuning for behavior. I don't think this is something that you would use rag for.

Dexter Horthy: 49:35

Like, I think the data you generate from having humans, like, relabel this stuff is super, super valuable as well. But getting them to do it well is like giving them the right interfaces. So the thing you're talking about, like highlight this sucks and highlight this is good for us email or a message or some kind of like text content, or even just being able to edit it and then we store the diff. Because then you could have the original as, you know, your negative pair and the, you know, updated one as your positive pair. I'm trying to think what else.

Dexter Horthy: 50:08

Yeah. No. I I agree. I think we've talked to people who like, oh, I I want to ping a human for input, but we're building robots. And the input task is actually like, I need to show them a picture and have them draw a bounding box on the picture because the robot is stuck and it doesn't know what to do.

Dexter Horthy: 50:25

Yeah. So, yeah. That's that's another thing we're talking about is like building think I think it'd be really interesting to see I think we'll see in this year, whether I do it or not, like a set of open source kind of react components to make it really easy to label data. That's that's kind of one thing that I'd like to see. And if you're listening to this and you wanna collaborate on that, definitely hit me up.

Dexter Horthy: 50:47

Slack has like a full

Nicolay Gerold: 50:49

The question is, do you see it as a plug in? Because I I know, like, developers especially, I don't have Slack installed or Discord. I just go in through the browser. And I think, like, using, like, a plugin would make a lot of stuff easier because you could do the labeling and just send it through a plugin, and you wouldn't have to build, like, integrations into all of the different tools.

Dexter Horthy: 51:15

When you say what do you mean by plug in? Like a cursor add on or something? Or

Nicolay Gerold: 51:18

No. Basically, it's

Dexter Horthy: 51:19

just a

Nicolay Gerold: 51:20

browser extension.

Dexter Horthy: 51:22

Oh, interesting. Why a browser extension instead of, like, a standalone just, like, web app that uses components?

Nicolay Gerold: 51:34

Because then he has to go to a new tab.

Dexter Horthy: 51:39

But the browser extension that shows a pop up on on on Slack? I see.

Nicolay Gerold: 51:44

Yeah. Just a quick overlay, label it, and be done with it.

Dexter Horthy: 51:48

Interesting. So from Slack, I can click I can do some action and it actually, like it looks like I'm interacting with something in Slack, but really it's a browser extension. Yeah. Okay.

Nicolay Gerold: 52:00

Well, I

Dexter Horthy: 52:01

hate it.

Nicolay Gerold: 52:02

Did you know the did you use the ARC browser?

Dexter Horthy: 52:06

Yeah. I, have a lot of mixed feelings about ARC. I won't go into it.

Nicolay Gerold: 52:11

Me too. I think, like, the one thing that they actually was really good was, like, their boosts, which make made it really easy to build custom extend custom extensions.

Dexter Horthy: 52:22

Okay. I've never built a browser extension. So I'm Yeah. It's so huge. To hear how that goes.

Dexter Horthy: 52:27

Yeah.

Nicolay Gerold: 52:29

And they basically just gave you like, you had a JS file, you had a CSS file, and you could make, like, an HTML file to render something. And they basically did all the scaffolding around it.

Dexter Horthy: 52:40

And Okay.

Nicolay Gerold: 52:41

This made it really easy, and it was really nice to use.

Dexter Horthy: 52:45

But it only works on ARC?

Nicolay Gerold: 52:47

Yeah. It only worked on ARC.

Dexter Horthy: 52:50

Interesting. You have to get out of spin. Dude, this has been super fun. I am so glad we got to hang out today.

Nicolay Gerold: 52:59

Nice. Yep. So if people want to follow along with you or like, give me, like, the ten second pitch of human layer as well. Like, what are you guys building? Who should come to you?

Nicolay Gerold: 53:11

Like, who needs you?

Dexter Horthy: 53:12

Oh, yeah. Sure. Totally. So, yeah, human layer is an API and an SDK that lets your AI agents that are kind of running out in the world ping humans asynchronously for input and feedback and approvals on certain things. So we have a customer in supply chain that's using human layer to when their agent notices like, hey, this order is behind, we need to send an email to the supplier.

Dexter Horthy: 53:36

It sends an email to the distributor and says like, hey, look, we're gonna send this email to we're gonna coordinate with suppliers on your behalf. Like, here's the email we're gonna send. Are you okay with that? And they can reply or they can give it feedback. So a lot of use cases and basically like building AI agents that operate in kind of more old school enterprises, you need to have mechanisms and levers to build trust.

Dexter Horthy: 53:59

And so human layer lets them build a lot of customer trust that this agent is gonna do is gonna do useful things, but not without approval.

Nicolay Gerold: 54:06

Yeah. And where can people find human layer but also you? Since I had the pleasure of reading through all your tweets.

Dexter Horthy: 54:14

Did I send you a a list of tweets? I'm sorry about that. I shouldn't have done that. I human layers, you can log in, sign up. It's self serve on humanlayer.dev.

Dexter Horthy: 54:27

Give it a spin. Let us know what you think. We are on there's a Discord. I'm on Twitter, Dex forty. I'm on LinkedIn.

Dexter Horthy: 54:33

Please, if you're interested in this stuff, give

Nicolay Gerold: 54:37

Dexter Horthy: 54:37

a holler. I respond to pretty much everything. So I'm excited to chat with the community here.

Nicolay Gerold: 54:47

So this was our first episode back from a one month hiatus. I've decided to revamp the podcast a bit. So this would usually be the place where I would share my takeaways from the episode. But I've decided to restructure the podcast a little bit and because the episodes were becoming longer and longer and the takeaways at the end were only, like, for, like, an one hour episode, we're getting, like, ten minutes long. So I have decided to split out the takeaways and publish it as basically a standalone audio.

Nicolay Gerold: 55:22

And this should be, like, if you are in a rush, if you are in a hurry, or if you're unsure whether the content is relevant for you, this would be the thing for you. So this will be published at the same time. So you you will get the episode and only the takeaways. And how I imagine you listening to it is basically, if you're in a rush, just listen to the takeaways. If you aren't sure whether the content is for you, listen to the takeaways.

Nicolay Gerold: 55:47

And if you're satisfied, you don't have to listen to the full episode. If it piques your interest, dive deeper into the full episode so you get, like, the two for one. As a bonus, I will start to publish more and more of the research that goes into each episode. So usually, I spin up quite a few examples, go deep into the different resources on the different tools we are talking about, and develop a GUI, TUI, a quick example, and Spinnava front end. And I will start to share more of my research in my code.

Nicolay Gerold: 56:21

And this will probably happen mostly on the YouTube channel, because I will probably add quite a few sketches on TL draw, the Excali draw, or share some code. If you want to catch that, subscribe on YouTube. In general, it would help a lot. If you leave a review, leave a comment, leave a like, subscribe on whatever channel you're listening on Apple Podcasts, on Spotify or on YouTube, helps with the channel. It also helps me get more reach and get bigger and more interesting guests.

Nicolay Gerold: 56:55

Also, I'm always open for guest suggestions. So if you think you know someone interesting or you're an interesting

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#048 Why Your AI Agents Need Permission to Act, Not Just Read

Subscribe