S3E2

#049 BAML: The Programming Language That Turns LLMs into Predictable Functions

May 20, 2025 · 01:02:39

Speaker 2 (00:10.925)
Welcome back to How AI is built. This is Nikolai and today I'm

Speaker 2 (00:20.344)
Bye bye.

strip away the magic. By the way, my vibe of Bama, what I love, it's like, it's a perfect tool. It does one thing crazy well and gives it to you and it doesn't like...

put so much, like it doesn't enforce so much on you, like how you have to build stuff in my opinion. I think a lot of the tools at the moment that are being built are really trying to enforce a certain way of working or building stuff. Whereas like in Bumble, it's a file and then it's a client and that's it.

Yeah, I think when we first started this idea, I was just looking at a lot of these frameworks and I was like, how could you possibly be making design decisions on code architecture when you don't know the applications and how they'll be built? It's as if to me, it's like, imagine trying to build like ShadCN in the early days of React. You can't do it yet. It's impossible to build ShadCN in those days.

You have to build React first, let React play out for a while, see all the front end frameworks that come out of it, and then you have to build Jets again. It's like you have to learn to do it.

Speaker 2 (01:44.403)
Yeah, so yeah, also first like try to figure out the best practices and the building blocks and the components. like the space is evolving so fast. We don't even know like what the building blocks.

Yeah, exactly.

It's our love to know like why did you decide to go with a DSL like what was the decision-making process and what were the different experiments like what did you try before arriving at the DSL?

Okay, I'll show something kind of funny. Should we do brief intros really fast? Yeah, let's do that.

Speaker 1 (02:23.64)
Duna.

guess, That's true. So I'm by both. I'm one of the co-founders over at Boundary and we make BAML, which is a new programming language for building AI pipelines. And that I guess leads to exactly what you're asking me for, which is why did we make a new programming language? Why do we have a DSL? And there's really just like two reasons for it. But I think the biggest reason is

You can, I don't have to introduce myself.

Speaker 1 (02:56.246)
The way you write with AI pipelines is fundamentally different than any other piece of software most developers have ever written in their life. These systems are no longer reliable. They fail all the time in a way that you can't predict, but yet we need them to plug into every existing piece of software. And that's just a different way of thinking. I think I have this funny thing that I was showing someone. Let's see if I have it up.

I I had this up.

Speaker 1 (03:30.286)
Here, I'll show you the way I was authorized like this. Like whenever we write regular software, none of us really get, like we're not accepting affiliate rates like this. if our database failed 5 % of the time, you, that database is dead. It can't even, it won't even get started. Right. But then like you change the word to LLM instead of API call. like, all of a sudden, like our, expectations are like, Holy cow, this is great. And like in reality, that's not great. This is the way I put it.

Like it's not, it's, this is never going to work. And the reason that this, and this fundamental difference of like panic to like amazement and the fact that people feel this way today is just a matter of the fact that technology is really, really new. But in reality, for us to actually get to three nines of accuracy, even with these AI pipelines that we built out, we need to write different types of systems. And there's one company in our neighborhood.

in Seattle that has done a really, really good job of building fault torrent systems. It starts with, it's the name of a rainforest. And they've built really, really good systems that allow us to watch Netflix and all the other things that have been built on top of it. But the fact of the matter is they have a lot of engineers working on making their systems completely fault torrent, EC2, S3, these core primitives that exist. But now every application developer

has to make their application fault tolerant. And that's just a different design pattern. When I build an app on top of GitHub, I don't really think that GitHub's going to go down. I'm not even designed for that. If it goes down, just tweet and say, sorry, GitHub's down. And everyone understands. They're OK with it. But now every single part of my application can fail. And that's just a different programming paradigm, which me and often whenever we have new paradigms, we often have new languages. And that was the incentive. And the second reason.

is that for some reason, everyone is so obsessed with Python for building AI pipelines. And like, it's not that Python is bad, but imagine walking to a senior design interview and you're asking the person to go interview and you're like, Hey, does that mean an API system that calls this API in the backend? And they're like, first, let's make a Python microservice. That's just wrong. If I have a Java app and I have to call an API endpoint and the way to do that is to make a Python microservice, that is wrong.

Speaker 1 (05:56.812)
Like by definition, it's not that it's not like, like you can't make a Python microservice. There are merits to that, being forced to make a Python microservice when your application is built in Java is, is just incorrect architecturally. So we said, if we want every piece of software in the world to have an AI, to be able to use AI, then something needs to be an adapter between the AI and every piece of software. And either every single language is going to have

frameworks that are all slightly different and nuanced, or we can do something stupid and make a programming language that plugs into every other language in the world with the exact same interfaces. And that's what we did.

Yeah. And that's like one of the things I love. I love using Bumble and Go because I think like Go, it doesn't have like any good libraries yet for interfacing with like LLMs. And for me, Go is the language that was actually like, it's perfect for it. It's mostly IO and it's mostly running over the cloud. it's like, Go is the perfect language in my opinion for LLMs.

We have a, I think it's actually, I'm biased, I love Rust, so no offense to people ready to go. But I've recently been learning, what do you have?

Rust programming language.

Speaker 1 (07:15.886)
I was I was I recently learning about go because a lot of our community has been using go. So we're actually building a native go SDK. So before we support go with open API as you've been using it, I'm sure. But opening is kind of ugly. Like I can't believe that is a standard that exists. It's just ergonomically not nice to write complicated systems in. And I was talking to another go developer and one thing they told me is like.

Go often doesn't have SDKs for a lot of things. So what I often do is I just take the OpenAPI spec and just create a Go client out of it. That's just like life as a Go developer. And now we have a native SDK in Baml coming out, which will actually be like no OpenAPI needed. The code gen will be much prettier. And I'm really excited about that actually. So the channel system in Go is really pretty.

Yeah, I'm not sure whether it will arrive at like Go developers because as like Go people or the gophers are just use okay use the standard library for everything and nothing else.

what I'm really curious about is like in Bumper, think like prompts are really seen as part of the code. And I'm really torn between seeing prompts as code or seeing prompts as artifacts, like models that are tested and then promoted into production.

Um, I've done a lot of, uh, experimental deployment pipelines in my past career, um, before I did this. I think the fact of the matter is at some point you need to put things into commits. Like you need to able to, even if you're running an experimentation, you're like, I want to run 2 % of my traffic through this experiment and 5 % of it through this experiment. And like, at some point that registry has to live somewhere. If you think the best place to live it is, is like, is like a, uh, is your code base.

Speaker 1 (09:12.174)
then you use code. If you think the best place to put it in is a database, use a database because can just pass, because Bamel is just like parameters, you can just pass in the thing loaded from the database into the prompt as a parameter. And then you have effectively the same, the same kind of experience that you would have had otherwise. But that, I think it's like, we're trying to not be, it's like you alluded to this at the beginning, it's like, don't, Bamel's not very opinionated in a lot of things. And I think that's like our stance here.

What I would do is I would put in my code base. But I know a lot of other teams do like experimentation living in a database. So we've tried to remain agnostic and support both paths. why do you think it's nice in a non-file format?

We have a lot of cases where we allow a lot of customizations, like a bunch of projects we're working at the moment, where there's so much user customization that you have to have some form of database in the backend where all the different configurations are stored and then pulled in. And then I also need some form of promotion. I'm testing a new prompt and then I'm promoting it to the database.

Yes.

Speaker 1 (10:29.934)
Yeah, I think in your use case, makes a lot of sense to have that in that case, like to have a database-based system. I think once you get into really, really dynamic UIs and everything else in that realm, then suddenly it's harder to go do that because now your UI components also have to be tied to the database. That's not that that is bad. It's just not the way that most code is written for those systems.

Yeah, would say it's like, it depends on how long living are the different prompt components. Like we have a lot of runtime aspects. These are just like arguments. Then you have like the really static parts, like the system prompts, they should live in the code. And then like, I think the fuzzy thing is like in the middle, which is like midterm-ish, like how long does it live? And there I'm really torn, like how, where do I want it? And at the moment, like I think

Yeah.

Speaker 2 (11:24.672)
My best solution is throw it in a database and retrieve it.

It also gives you the power to be like I can have someone non-technical go edit and make a change without having to do a whole deployment, which is worth something.

Yeah. Have you, do you think Bumble is also a format which I could give like non-expert?

Yeah, we have, we literally have a lot of non-engineers actually editing BAML code in their engineering teams because it's like, it's, I think the, so for everyone that doesn't know, it's basically a weird kind of, I won't say weird. It is weird. Everything is weird. And the first time I saw Python, was like, this is so weird compared to C++. And the first time I saw TypeScript, was like, what the heck is this async await keyword? So I guess BAML is also a little weird in that sense. So I guess I should start with that.

but the way I always looked at it is.

Speaker 1 (12:20.768)
A lot of AI stuff is really experimental by nature. And people are pretty smart. Most people that I've met are pretty smart. They pick up things that are not too hard really, really fast. And what VAML gives you with the VS Code extension, like the Playground button, where you can just edit a prompt and you can see, it's very visual in the way that you write VAML code. You write VAML code like regular code, but then you get to see this preview of your prompt. You can see your inputs highlighted a way differently than your prompt. have to press

play in your test cases and execute your prompt. And I think that is a huge, huge boon for everything in terms of non-technical people learning.

Yeah. And suddenly you can also, I really love because at the moment, I think you're just abusing experts for data labeling. And in most cases, like when they do labels and give feedback, I just literally copy and paste what they have written and put it in the prompt. And I think like if we can build a format which we can just expose to them directly and they edit it, it's much more likely to get good results. And also like

If they can iterate on it and like same input, change the prompt, get an output, and basically they do the iteration work, it will speed up the entire process from like testing to production.

Yes, and I think one thing a lot of people don't think about is how do they frame the problem to their experts in a way that's usable for them? So we actually have a lot of customers that will create a mapping of the terminology their experts knows to type systems like we developers know. So I'll give you one really concrete example. Imagine I'm a doctor and I'm doing a conversation with a patient and I want to summarize the recording into some notes that I want to save in my...

Speaker 1 (14:14.286)
EMR system in the hospital. When I go do that, me as a doctor, I like my notes to be all bullet pointed. I want these three sections, this section is bullet points, this section is a paragraph. A different doctor, same thing, just wants a list of like, here's the patient's temperature, here's how they were feeling, here's the issue they came up with in a short sentence paragraph, but not in a bullet point. So how do you navigate that world?

Because if you tell the doctors to prompt the engineer, it's not going to work. The doctors can't actually prompt the engineer and make that work. No matter what they do, they will never use your technology if you put them to do that. So imagine instead, you just ask the doctor, tell me the section that you want and tell me if you want it be a bullet point list or a short answer. And if they say bullet point list, then you say, I have a data model where I have a field that's a section of the title that the doctor said.

and I want to string array as the output. But if they say short answer, I just want to string as an output. Now we can kind of build a type system or like a programmatic data model that is describing what the doctor wants in a way the computer can understand without the doctor ever articulating it. But the doctor is using terminology that they understand, which is bullet point list versus short answer or like multi-paragraph. And we can, we can build the best type system for them.

without them even knowing that they're doing that. And I think that's the real answer I had to it. That's kind of what we support with like dynamic types. And that I think has been a lot of success for a lot of people that work in like specialized industries.

Yeah. And at the moment, think like one limitation, which we talked about before is like, there isn't a real orchestration framework around Bumble yet. How are people at the moment orchestrating like more complex workflows? Because for example, in like the medical use case, it's probably a multi-step workflow. Like you first extract all the different facts and then you compose it together, however the user prefers it.

Speaker 1 (16:27.126)
Yeah, you asked a really, really good question. So like, I think I'll screen share it because I think this is just where like we have a really fundamental different approach to what we even think an LM is. I think everyone is thinking of an LM completely wrong. Everyone somehow thinks this LM is this magic thing that just answers my questions and gives me the right answer. I think that's where everyone messes up. And I think the reason that that happens is because

The way we think about LM is like, sorry, me move this. The way that we think about an outline is it looks like this to me and LM is purely a function. That's all it is. It's just a function, which it takes in an input and it spits out an output. It's like in our computer, we have, we have, we have a calculator, a calculator that takes in two numbers and takes in an operator and it spits out an

And it can just transform two numbers with the help of an operator into a new number that represents something. That's all we're trying to do here. We're trying to take in a thing that takes in a string and spits out one of these three strings, whether I use an LLM for it, whether I use a human to answer the question, whether I use a calculator to answer the question, it doesn't matter as long as I can meet this contract that I have to find. And, then LLM to us is the same thing. It's just an operator where

The operands are the parameters are the parameters of the function and the model that you're using. So the weights of the model combined with the prop that you feed into it should somehow transform the input into the output. And once you start thinking of this system in this way, tool calling becomes really simple to think about too. Tool calling isn't really, it's not transforming into one operator. It's something that's saying I want to transform some input into one of N choices.

That's really all tool calling is I'm going to give you an input and I want you to pick one of these options as an input. And when you think of it this way, writing orchestration for it becomes really easy because you just have a function that picks a tool and based on the tool it is, you're going to write a switch statement to go call the right code. That's all the orchestration you need. Everything on top of that is bloat that you can choose to have or not have. It can add value. It can not add value to really up to you and

Speaker 1 (18:48.206)
Now, all of sudden, if you have the parallel to a calling, your function signature changes from returning one tool to returning many tools. We have that concept in programming. It's an array. So it returns an array of any one choices. And now your orchestration logic is still the same. You're just adding a for loop around it. Right. And then an agent is almost exactly the same thing. It's just a function that takes in some state of some kind. Doesn't matter what it is.

spits out a new action and then you take the action and you then add to the state over and over again. And then you just add a while loop here and that's an agent. All right. So like, like, do you need an orchestration for Mike? Yeah, we should probably support while loops and if statements. And, maybe if we talk in two months, I can show you something really cool, but like you could do this at Python type ship Ruby. The only thing you really need is

a way to turn the L alignment to a function. And that's what we do.

think for me in terms of orchestration framework, I'm thinking is rather something like Metaflow, not like what we call orchestration frameworks at the moment. Like it's easier to orchestrate different function calls that for me isn't an orchestration framework. I think in terms of like data orchestration or ML orchestration, where I deploy onto hardware directly. Got it. split up my code into the different components, which then run on the same service.

because I think that's what is missing. Like an easy way to actually build pipelines and deploy them.

Speaker 1 (20:25.326)
Yes, I see what you're talking about. And the real reason you want to do that is because of durability. Right, you want to add durability. And so I'll toss another wild card out there. The only reason we add them in totally different network machines is because we don't have a way to add durability into the same process. So async await, let's think about the primitive of async await. Every language has now introduced the idea of async await because

It's nicer when people, users don't have to do multi-threading. We just let the language do the multi-threading for you. What if the language could do durability for you and you didn't have to think about it? What if there was a keyword that added durability into the execution of the language?

and now you don't need to deploy. Like Async08.

It's, you only, I think like at a certain amount of usage, you have to split it out. think like it depends on like all different stuff you're running. Like if one of the tools is like a browser, I would want to deploy this on a different. Just because the resource requirements are so different.

Yes, exactly,

Speaker 2 (21:42.442)
If I'm running like a local model or a custom model, I would have that on a GPU or running on model. So I would have that running somewhere else. So I think like splitting out is not just like durability, also different resource requirements.

Resource requirements makes a lot of sense. think that's like, we've seen a lot of people do like, I think like the interesting thing though is from an application perspective, somewhere you're to have application code that calls a bunch of these other microservices, that service will also need to be durable. Right? So no one actually puts the model in the same service as their actual backend code because models are a totally different, like you said, resource requirement, the way you actually host the models. But I think for application code at least,

It's nice when everything doesn't have an extra network hop attached everywhere. But usually what people end up doing actually with DEMO is they actually wrap it with Prefect or something similar to do like deployment systems on there. And that's like what they have been looking at is like, how do people do this in other languages? It sounds like a pain if you're not using like Python or a TypeScript to build these kinds of workflows out because like there's just not that much support.

Yeah, it's... Do you know Temporal?

And no temporal, they're great.

Speaker 2 (22:59.182)
I really like the, like, their approach they are doing. It's like, for me, it's still overly complicated. Like when you have to implement it. yeah, I think like it took me a while to get the hang of Temporal.

It's different. It's a, I think, like I said, just a different way of thinking. That's what I was alluding to this whole stuff. Temporal is a different way of thinking metric code.

Yeah.

Speaker 2 (23:23.33)
But when you mean like you add one keyword and you add durability, like how does it manifest? Like do you have a cache? Because I think like the concept of durability, you have something persisted, like a state of the application. Somewhere else than where you're running your code. So you can actually be sure when the code or like the node where your code run fails that you still can restart it somewhere else.

Yup. How do I just, I think if I describe, if I tried to describe Vammel to you a year ago, before we had anything for you to use, I think it'd be really hard for me to describe. For a lot of software concepts out of this, like out there, I think most of the time it's easier to type and feel your hands around it. So if you message me in about like two or three months, I think then we'll have a thing for you to play around with that will have something interesting. And I don't know if it'll be, huh?

We will do another one in two-

But I think it's, I think when I think about durability, like it's the same with async await. Like why did JavaScript add the word async await? It's because a lot of web applications have a network dependency. And when you have IO happening, you don't want every developer to write multi-threaded code because they will do it wrong. Like they will just do it completely wrong. So you solve that problem by giving async await. then because got, and like Python is like a global interpreter lock. No developer is going to build.

releasing the lock correctly with like the core C primitives for their system. They're just not going to do that. So you do a single way so you can do it properly on your behalf. And I think when I think about it, that's like the

Speaker 1 (25:06.902)
Like that's how I think, how do you, if, we suddenly need a lot more software to be durable than we had before, we probably want the language to do the heavy lifting, not the developer.

Yeah, I would love to know what do you think then about Python releasing the Gill with 3.14?

I think it's in the right direction. for background, I actually went deep into the Python C code before. I've optimized parts of it in a weird way. And it's old. That code is really, really old. And I think it's due for rethinking because the Gill was designed for a different era of software computing.

Right, it was like...

Speaker 1 (26:02.646)
It was designed for an era where we didn't predict this much interactivity between machines.

And most code now is just IO. A lot of code for a lot of people is a lot of IO. computation is just not the bottleneck. So like you kind of should do that in my opinion. It's like not the wrong thing to do. I don't know their exact implementation and like what they did, but I think directionally, like when I first heard of it, I was like, thank God. That was my first reaction.

Yeah, I just think like multi-threading in itself, it has a lot of foot guns. So I think like, async await, it's really simple. It's easy to understand as well. Like it's just a loop we're iterating through. That's why like frontend is a little bit easier often to understand, like how does everything work in my opinion.

Bye for

Speaker 2 (27:02.186)
Like once we gain distributed systems, everything becomes complex.

Yeah, no, I agree. So like we're coming probably from a different world than a lot of other Python developers are. Like you're coming from Go World and like I clearly rust a few other systems programming languages and that's the same world I come from. So then we think about things probably very differently than how Python thinks about it. But Python, one of Python's core principles is simplicity. And if simplicity is the goal, who's the best person to manage the multi-threading system? And it is 100 % the language.

There will be less bugs. They will do it more correctly and they can do the extra work. That is a that that is like five microseconds of optimization. but it's worth it for them because everyone gets that. And I think that is why it's worth it. Like no other reason, but the fact that like a language, a compiler can do things that other people can because it's worth it for the compiler, but not worth it for a team.

Yeah. I think I want to pick up on something because the way of thinking how you approach it, basically an LLM call or everything you have, it's just, it's a function call. You take an input, you get an output. I think LLMs are like, they often fail. And also like, especially like for the last few weeks, you see with OpenAI, like they are down so much because the research requirements are so much higher. So treating it as a function call.

gives you a right mental model, like how to implement reliability. It's other function calls, like fallbacks to heuristics, fallbacks to other models, fallbacks to classifiers or whatever way you can substitute it.

Speaker 1 (28:49.614)
error or show an error message that actually makes sense. agree. think a lot of people are like, when you, I think it's important to take the magic away a little bit. I love these things. These models are so beautiful. They are magical and undermining that is bad. But at the same time, when we write code, it's really bad when the thing feels like too much magic because it becomes impossible to actually like wrangle. And I'm

Like I can't imagine being an engineer and my hope of getting something to work being like, I hope I type in the magic three words of English that make my system work. That hope sounds so bad. It's not like, how do you iterate on hope? You can't iterate on the English, like the English language. There's no debugging. It's just like, I write some words and maybe the right things come out. That seems pointless. Um, so like our theory is just like, you want something better than hope.

Taterate.

Yeah, so basically like the best practice first set up a test data set, set up evaluation function, write the bottom.

Exactly. Yeah. And just start playing with the models. I look at a lot of designers and like who are the best designers that we all know. They're people that have designed tens of thousands of components, tens of thousands of different types of components. That's what makes them really good at design. They weren't born being genius designers. They just designed a lot of different things. if imagine if every single thing they designed took them one hour.

Speaker 1 (30:25.742)
that automatically becomes 10,000 hours of design that's put in. If everything takes five hours, that's 50,000 hours of design work you have to put in to become really good. But now imagine the same thing with using these models. If it takes you five minutes to try one prompt, that's 500 minutes to try 100 prompts. You are going to give up before you get to finding your best prompt. If it takes you five seconds to try a prompt, that's 500 seconds to try 100 prompts.

that just like if you can do an order of magnitude faster iteration, you become faster at understanding how these models work. Which means when the newest model comes out, you can adapt your mind to what the model is doing differently than the old model in five minutes instead of 500 minutes. And that matters for success in this new world.

I think that's in general for software engineering. When you look at a behavior you want to implement, first, how you can actually figure out, is it working the way I expected? You write a quick test and you just call it and you implement something and see whether it works. And through that, you have the feedback mechanism to actually iterate forwards. And once it passes,

And you just leave it alone, don't touch it.

What I found really interesting in your examples, you have symbol tuning and I had something completely different in mind to be honest when I clicked on it. Can you maybe quickly describe symbol tuning and how to use it and why it's actually

Speaker 1 (32:04.526)
Yeah. So, symbol tuning, how do I describe it in audio form? okay. Let me try. So whenever we ask an LLM to spit out some data or some, let's say we ask an LLM to become a classifier and I called my one of my, and specifically we're doing intent classification for like a shopping chatbot. So it's like an e-commerce chatbot of some kind. One of the categories might be account issue.

And the things about account issue that I might care about is like my name and email being wrong. or, or it might be like, but it might, I might have a second category that's related to like technical support where like the website is broken. When I just title something account issue, or I title something like technical support, it's not actually obvious to the LLM what the differences in those words are, even though to me as a human, it's so obvious what it is.

So we might add things like metadata to describe each of those categories. So now we have a name of a category and a description of a category. But when I have a name and a description, the problem is all those words are impacting exactly what the LM thinks about that category. When I say account issue, and then the description on account issues and technical issue, the LM is making assumptions about what the word account expands to.

So what if I just renamed the category for the LLM to just be like category zero and the description of what an account issue is. Category one, the description of what a technical issue is. The title of the category, category zero has no affiliation and it doesn't bias the model in any way towards some description. So it focuses all its energy on the description that I attached to it and ignores the name.

And then if the model picks category zero, I know that's an account issue. That's kind of the premise of what symbol tuning is. It's about letting the model focus specifically on the words that I really wanted to care about and taking away words that I don't. I'm using symbolic terminology there to make it more, to just give yourself more control.

Speaker 2 (34:22.634)
Yeah, so to make it a little bit easier to imagine, it's basically you're generating a JSON with multiple fields and values. And instead of using keys, which are like very descriptive of what they actually are, you're just using something like ABCD. So basically a symbol. I think also like what is a little bit overseen is it makes it way easier to generate. Like the likelihood that the LLM fucks up the key K.

Yeah.

Speaker 2 (34:52.78)
is very low because it's just a single token whereas account issue probably are already two tokens.

Yeah, at least.

So yeah, there's like more room for error there.

Yeah, I, I, because I would have expected a rather really interesting post a few weeks back on a linguist who's working with LLMs. And she was really going after word choice. That word choice also matters a lot because you bias the LLM into a certain direction with your word choice. And I think this is always something I try to imagine. Like when I generate something and I want a certain answer, like

think about how does the document on the internet look like, where the right answer is in there. And you want to emulate that document. And I think like the linguistics and word choice part really fits into like this direction as well. Thinking about the word choices and where you would actually find this.

Speaker 1 (35:52.3)
think in the LM world, if you can say the same thing that you meant to say, but with five less words, it's worth it. I want everyone here that's listening to think about this. Imagine if both of us were on here and we were saying the word like and every 30 seconds. You could still understand us, but you would tune us out so fast. You would not want to hear that. And every single time you tell the LM the same thing, you could say five words, but you're using 50 tokens to say it.

It's tuning you out. It has to do so much more. I mean, it's not tuning you out because job is to try and make you happy. But like it has to do so much more work just to understand what you're saying. So I think it's about like what I call like lossless compression. You want your idea to be as losslessly compressed as possible.

Yeah, I think like with prompts, people always tend to add and never remove stuff.

Yeah, exactly. I'm like, what are you doing? You know how they say in software engineering, the best code is deleted code? Same with prompts, in my opinion. The best, the best props are where you deleted things from your prompt, not added things to your prompt.

Yeah, and I think like the cost of removing anything text, so whether it's code, whether it's your prompts or whether it's texts, it's basically zero nowadays. Because you can regenerate it so fast. Yeah. So I think you should be way more open to actually just deleting stuff and trying it again.

Speaker 1 (37:20.078)
A lot of people when their prompts aren't working, what I tell them to do is like, can you just delete your prompt and start running from scratch again? Cause like I see sometimes people have like a page plus of prompts. Like this before they use BAML. I don't really see page long prompts at BAML too long, but I'll see like the entire screen is full of prompting. You scroll down, there's still more prompts. like, I asked them a very simple question. like, have you read your whole prompt ever? And most of the time the answer is no. They've never read the whole, they've skimmed it, but they haven't read it.

And I'm like, if you, as a developer who really cares about this application, can't bring yourself to read the prompt, what makes you think the LLM is able to pay attention to every word that you wanted to? And aspiration, I know we want it to, but like, in reality, like it's a, you're probably putting a lot of noise in there. don't need to, because you, you yourself haven't read the code.

Yeah. Just give it to a junior developer and see whether he can use the instructions.

That is a great test to be honest.

I think the space is really interesting. How do you think, like, where does Bumble fit in, in like the entire ecosystem? I think I'm a big fan of outlines as well. Using it, especially for like open source models. I also like Instructor. I think it's like more, it's like a really easy plugin. It's fast. How do you think, like, how does...

Speaker 2 (38:49.218)
Bummer compared to those and where do you see Bummer fitting into like the entire ecosystem of tools?

love you day one

I think the best analogy is like Next.js. When I first started doing websites, I did not. my God. my God. Yes. It's like Next.js. Let me take that back. Like Next.js if it didn't have bugs like that. But no, when I first started, so I came from C++ world, super, super low level programming. I spent a decade writing. Basically I made a career writing assembly for a

I'm little weird.

Speaker 1 (39:30.37)
So when I first started doing websites, I was like, nah, I don't need these frameworks. These frameworks are garbage. So I actually made my own version of React from scratch. Like it had state management and page routing. I somehow came up with all those concepts independently. And when I was doing that, when I started my company with my co-founder, I was like, he also didn't know front-end engineering. So we're like, what should we use? And I told him, I like, I had this framework I've been working on for a long time. I was like, we don't really know JavaScript. was like, we should use this. He's like, no, we're going to use React.

It's the right thing. So we learned react and the first time you learn react. It's really confusing It's like what the heck is like use memo like why like and like you you get it eventually and like what is you state doing? is use effect doing? Like how are these things actually working under the hood? Because like for me, I can't just trust the magic. I have to know what's actually going on to some degree for it to make sense But now that I've learned those and then you you go from learning back to learn next. Yes You're like, how is this routing magic happening?

And there's this mental model you have to build of what's happening under the hood before it becomes useful. But once you build that model, there's no faster tool to iterate. And that's how I look at it. So like outlines and instructor, think are like, yeah, like if you're only in Python world, it's a great way you can pip install and use something you already know. And it's like, there's no learning curve and you kind of just use it. But just like React, spending four hours on a Saturday will make every other Saturday in the future way better.

That's kind of where BAML fits in. Like yeah, I'm not gonna lie to you. It's gonna take you like one or two hours. How long did it take you when you were first playing around with BAML to get the hang of

I replaced one of my production systems with Bama and I think it took me like six, seven hours to get the hang of it and then replace the existing one.

Speaker 1 (41:18.926)
How long did it take you to replace your second function?

Speaker 1 (41:25.42)
Exactly. Right. And like, I think that's the difference. Like it, it's a new thing. You have to learn a little bit more and you have to spend that Saturday. So that's why I never try and push BAM along to anyone. It's like, have to want to do it as a developer. Just like React, you can't force a developer to learn React if they only know JavaScript and they've never done that before. Like they have to spend a Saturday. so like that's the difference. Like instructor will never give you the type of developer tooling that we have because it's tied to Python.

You can never use that in Go. You can never use the same thing in TypeScript. Outlines, to me, outlines is an algorithm, but honestly, think that algorithm doesn't really make sense. It's the same algorithm that everyone else uses for like generating JSON, where they filter the outputs of the LLM to only construct valid JSON. But why the heck are we using JSON as the optimal serialization technique for models? Like, come on.

Like, like what the heck? Why? Like imagine generating code and like you have to write the code in JSON format to for it to be valid. Instead of writing enter, you're going to backslash and I couldn't write. There's no amount of money that could you could put on the line and I could guarantee I could win it if I had to write code that way. And like, what makes you think an LLM can do that? So like my theory around all of this is like, it's not that it's like bad. It's just like,

It just doesn't make sense architecturally and technologically for that to be the best serialization format. So like our innovation with BAML, like our schema aligned parsing, which allows the model to output almost anything. And then we do error correction after the fact, just produces better results. So I think what we showed is like BAML with GBT 3.5 beats GBT 4.0 with structured outputs on every function calling benchmark.

And like, there's a reason for that because architecturally JSON is the worst format for a model to produce for a lot of questions. So like, it's all like, it's faster because you can just pip install it and like, have to learn nothing, but I think everything good does require a little bit of learning because it's good because it's different.

Speaker 2 (43:40.236)
Yeah, I'm really curious like what kind of mistakes can SAP, I'm not sure how do you pronounce it.

Sap, schema line parsing is sap, yeah.

Because I want to see, say, SAP because of the German

I know, know, yeah. Actually, they use BANL for quite a few things, which is kind of funny.

It's like I had to use them a few times or do integrations into it. I hate it.

Speaker 1 (44:12.558)
They're a giant herb.

Yeah, what kind of mistakes can it fix?

I so that I used to think of it as like what kind of mistakes does it fit? But I now have a different framing in my mind that I think has been helping us orient the algorithm a little bit better. What we say is what is a way that the LLM could print out some content that would make sense. So let's take like a JSON element that has, let's say we have a class where one of the fields in the class is like a markdown content. The best way to dump out markdown content is not in a JSON string format.

you'll have so many escape characters and bam all the model can literally just dump out markdown content right in the JSON element with new lines, no escape characters and it will fix it. We can do an escape quotation marks. We can do keys without quotes. So like you can write like a type of value that doesn't have quotation marks on the keys and we'll parse it. You can, an element might forget that one of the fields is an array.

but if you expect it to be an array, we'll convert it to an array for you without you having to think. So like, we can not only like remove characters, we can add characters.

Speaker 1 (45:28.046)
And slowly we're actually giving you more control so you can write functions that modify the algorithm in your own way. And that's really the direction that we're going in. Where it's like you as a developer can add to the algorithm, not like remove from it.

You let people write macros.

we're gonna support macros. Macros is the right way to think about it,

Nice. It's really interesting. It's like, what are the algorithms that are behind it at the moment?

I mean, it's all open source. So you can look up the BAML repo and go through the code. think it's probably like 10,000. It took us about eight months to come up with that algorithm. Eight months of a lot of coding and seeing millions of outputs of different users. But it's really a recursive system, so it's hard to describe. To be completely honest, I don't even know the full code in my head anymore from when I first wrote it. This thing is like...

Speaker 1 (46:28.782)
one of the most gnarliest pieces of code I've ever written. Cause like you have to do so many things. You have to do like error correction and then you have to like coerce it. But I think the best way to think about it is imagine taking every data model that you have and saying, can I convert the thing on the right into the thing on the left? It's like a converter. So if I want an int and the LMGit gave me an int array, can I convert an int array into an int?

If I want an int and they all give me a string, can I convert that string into an int somehow? And like, you just think of it like that. And that's kind of the gist of the algorithm. but then there's, it's just a death by a thousand cuts kind of system. And like, just, we got a thousand cuts.

Yeah. Yeah. And I'm thinking about, which conversion method is then prioritized? Because like, for example, like, so I have a list, single integer, I got some, could average, could take the match.

Yeah.

Speaker 1 (47:26.978)
Exactly. So we basically have a scoring system built in. And basically it's a weighted system that looks at basically the coverage of conversions and then does a weight on some sort of error metric and then makes the right decision outside of it. And the best way to see if it works is just to try it. But I know people that are like, I don't think we get bugs in the algorithm anymore. People are just like, it just works most of the time.

Speaker 2 (47:55.534)
Do you encounter like model refusals from time to time and how are these handled?

Well, anytime we fail to make a conversion, we raise an exception.

Yeah.

So I think that's usually the best way to think about it. It's like if you use outlines or something else, so let's say you use a technique like constraint generation, like what outlines offers or what open AI offers for structured data. And let's say you have a LM prompt that's supposed to spit out a resume. Like it spits out a resume data model. And your user is dumb and uploads a receipt. You're still going to get a resume out of it because they force it to output a resume.

When you use BAML, what ends up happening is the model will answer how do we want, and then we'll try and convert that object into a resume. And because that conversion is happening without any LLMs, it's very algorithmic and deterministic. So what we'll say is like, Hey, we can't convert this to a resume. So raise an exception. So, that, I think that is the right behavior. Like I would much rather get an exception than a fake resume.

Speaker 1 (49:04.109)
Give it to me.

Yeah. What I'm also, because there have been a bunch of studies which benchmark, especially I think like outlines and structure generation at large in, for example, creativity against just raw dogging of the output. Yeah. Have you compared that with Bummel? I think the study couldn't be replicated anyhow, but have you done any testing?

Yeah. So it turns out because Bamal is so much more freeform and it doesn't actually require like JSON in its output, you actually can get a much, much better results. It's not as good as plain English in text form, but it's pretty damn close because what you can do is you can represent plain English in text form in Bamal structure much more easily.

So when you are like fully creative tasks, like free flowing probably BML wouldn't be your go-to if you have something more structured, like for example, in a CV.

Well, even in that scenario, a free flowing task is just a function that doesn't return a class. It's a function that returns a string. And then all our tooling for iterating and testing very quickly, writing VS code and writing test cases is still really useful. Cause even in a free flowing problem, how do you know it's creative enough? The only way to do that is to press the play button and run test, run test, run test on like 50 different tests over and over again until it meets your criteria.

Speaker 2 (50:10.936)
Yeah.

Speaker 1 (50:31.234)
So even if you're just returning a string, the iteration loop to get to 10,000 design cycles is still faster. It's a tool, the algorithms are nice, but really the benefit of BAML is how fast your iteration loop ends up being. And the fact that you can be confident that the model will respond with the thing that you want.

Yeah.

Speaker 2 (50:51.786)
Yeah, I love this like fuzzy chasing label of bubble to actually get something useful out of it. What I'm really interested in like what are you at the moment most excited about?

We have a new thing that's coming out in two weeks. I'm really, really excited. We did something. We're always trying to cook something new that no one else can do. I think when we first started making BAML, I remember someone being like, nah, you can never make a programming language. No one's going to use it. Now we literally have 700 companies using us in production. And it's wild to me that that actually happened. I have so much respect for all the customers, and yourself included, started using it. It means so much to us.

So now we have a new thing that we're cooking up, like the idea of durability and a few other concepts in the syntax. And I'm really excited to get people's hands on that. it's going to be, think I'm, and like, I'm obviously very biased and like I'm spending my career trying to build this thing out, but like, I think it's a new way of programming the way that we wrote the code and it's going to be, it's going to be really fricking cool. And I'm really excited about it.

Yeah, I think this is in general something I am missing a little bit from a lot of startups. I think you should always have like two different roadmaps. One is like the public facing roadmap, which you get through customer feedback. But you mostly they have the same problems. And then you have like the stuff where you collect all the feedback, you put it together and you use all your subject matter expertise to synthesize something new and put it together.

Yeah, I think if you keep on asking people, people tell you they want an AI framework, you end up with something like leg chain and no, no fancy leg chain, it fucking sucks. It's just like, it's just not good. But I think if you ask people for it, that's what they want. And I think, but no one would ever tell you they want BAMO. Like no one ever told you that. It's just so out there and so weird. And you need so many brain cycles to just sit down and think before you can even start working on something like this. And most people are trying to solve their business problem.

Speaker 1 (52:58.766)
And they're not thinking about everything along the way. like when I, when we were thinking about durability, we just try and one thing we do on our team is we just try and think from first principles. And there's this premise of like framework or primitive. do you want, and I think the difference in a framework versus a primitive is frameworks are easy, but very, very hard to debug. Primitives require you to do a little bit more work, but debug ability is really, really high. And what we lean towards today is BAML is all about

primitives. We want to expose as many primitives as possible that are useful and that will still be useful five years from now. We've never built a feature that we've had to retract. So like a lot of people when assistance AI assistance came out from OpenAI, people are like, can you add assistance? And me and the team were looking at this and we're like, this code looks like shit. Like there's no way I'd build a system on this. It just looks architecturally wrong. And I think we were right.

like clearly open eyes recalling assistance and they're not supporting it anymore because it is bad code design. But chat completion was a thing that we supported from the very beginning because we thought that was a really, really good primitive. So we were a little bit slower to add like new things that models release because we want to see things come out and like become stabilized a little bit more. And we also want to feel that they're really good primitives. So that like, I, I, takes, we like basically like

whenever someone says a problem, we hear the customer problem. Like you mentioned, you're like, you wish you could orchestrate the workflows. But then we go back and we're saying, okay, what if we have to redesign it from scratch? Like what if there was no prior art at all? How would you go do this and take advantage of all the learnings of prior art, but then give it in a way that is useful and ergonomic at the same time.

Yeah, I think we already covered what's next. What's for you? What is something you would love to see built in the space that's not BAML? Like shout out to the community. I want this.

Speaker 1 (55:09.634)
Um, let me think.

Speaker 1 (55:14.542)
I mean, I just spent all my time thinking about like how people write code. I freaking love code. Like for me, be completely honest, I think I love the most is code. I love code. I don't know why it's just for me, it's like an art form that expresses the most about how I feel about things. So like BAM was just where like my brain goes like a hundred percent of the day. But let me, I think the biggest thing that I'm really excited about is honestly a paradigm shift to a developer mindset. Like.

And I just want to go and have really interesting conversation with developers.

about different techniques and stuff that they learn. We're gonna all learn so much about the way that we think just by using these models. So when Google search first happened, a lot of developers, I don't think, had maximized Google search in the beginning. But now, everyone knows all these keywords. You always do site, colon, reddit.com to get a really good result, right? There's these things that are helpful.

along the way that we've all learned and adapted to. And just even the way we ask questions to Google has evolved over the decades. And the thing that I'm really excited about is how our brain space is going to change. And then it's not about the specific idea that comes out of it. The specifics aren't that interesting to me. What's really interesting to me is the paradigms and the mindset shift that comes out of it. If everyone is now thinking about durability at the application layer,

What does that mean for what kind of jobs are interesting in software? What kind of new jobs are being created? And like, that's the kind of companies I'm most interested in people that are making like huge long bets on that dimension. So like, I want someone that's like thinking about like, what does it mean if every piece of software has to be durable? What does it mean if I want like, if every company is going to have like one 10th of the number of employees that they have today? All right.

Speaker 1 (57:16.974)
Or maybe they will have as many employees as they have today. They'll just be expected to do a lot more. Does that mean we go towards a more IC world and a less managerial world? So are we going to have people that are doing AI project management? I don't know. And I think that's the kind of stuff that I'm really interested in building or seeing built, where people are making a long bet on human behaviors of being slightly different.

And like, will try every AI tool out there because like for me, it's like a huge learning. So I've tried like, um, like obviously all the AI editors, but like there's these like interesting, like AI email optimizers that I try because like, how do I filter to my emails? And everything still feels like it's still trying to like add up one step on top of it.

I don't know you're still in the beginning of the App Store, like you're building a flashlight app.

Yeah, exactly. Exactly. And like, and the thing about like, like an actor is a perfect analogy because like Tinder doesn't work on a website. It only works because of a new interaction paradigm of the swipe. If you don't have the swipe, you have to build okay, cupid. like, that's what I want to go see. I want to see like what is newly possible from these paradigms.

It's what I'm thinking about is like we are humans aren't the driver of actions anymore. think like we will be doing more and more delegation, giving feedback, editing and like approving, rejecting instead of really doing work ourselves. And I think like the interfaces which are necessary for that are way different because and also like, how do we handle this? Because it's just like a constant stream of information.

Speaker 2 (59:08.278)
and you're gonna get hammered.

Here's something that a lot of people don't think about when they think about approval and rejection workflows, for example. A lot of people are like, oh yeah, if we put a human loop, it's going to get better. But I'm going to describe to you a scenario that I think is like this. So imagine I have an AI pipeline that is 98 % accurate. So 2 % of systems fail. Now the person that's doing the human approval system is going to see 98 % yes and 2 % no's. You know what they're going to get me muscle trained to do? They're just going to be muscle trained to hit yes, yes, yes, yes, yes. Cause that's like what is happening under the hood.

they won't actually find the 2 % failures. So in order to do this, you actually have to build a system where they see like, you have to somehow increase the number of failures by 50%. Like you can't have 2 % failures. You need like 10 % failures or 20 % failures or 40 % failures, but you don't want to make your pipeline worse. So now you have to build a sampling system around these. And like that, I want to see someone spend time and energy on that because like that is worth a lot of money because

A human loop system is pointless when the human is being trained to ignore the work.

Yeah. And I think like you can look at really advanced DevOps organizations, like what changes can be automatically pushed and what can't is exactly like this problem. What actions can be LLM automatically execute and what do we have to escalate to a human?

Speaker 1 (01:00:29.038)
Exactly. And like you want to do the minimum amount possible because you want to send some good and bad to the human so you can build a good data set. But you also need to be able to detect the bad really accurately. So I think like my opinion is people are just, I want to see that kind of bet. want it like that. The one we're just talking about right now, that is cool to me. Like systematic things that people are making bets on.

Nice. Where can people learn more about them?

If you go and search BAML on Google, number one will be Bank of America Merrill Lynch, we'll be number two. So that's probably the easiest way to go look it up. It'll link you to our GitHub. Our GitHub will take you to a bunch of different places. But the best way if you really want to ask questions is join our Discord. We've got a fairly good growing community now. And if you ever ask a question, our goal is to respond to you within three hours, any time of day.

So like if you message us, we will respond. And let me think of one else. What is the other best way? The other best way is just like find a fun Saturday where you're just bored and like take a stab at it. And if you don't like it, reach out, tell us what you didn't like about it. Like it genuinely helps us a lot. It just means a lot to us that you even gave it a shot at any point. So if you're just bored and you want to try to hack around with something weird and you want to have some, try it.

and we'd love it.

Speaker 2 (01:02:02.814)
That's it for today. Let's quickly talk about our new deal. I'm committed to bringing you detailed and practical insights about AI engineering development and implementation. In return, I have two simple requests. First of all, hit subscribe and like right now to help me understand what content resonates with you. And if you found value in this podcast, share it with one other developer or tech professional.

who's working with AI to help it grow. So we can build this podcast together.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#049 BAML: The Programming Language That Turns LLMs into Predictable Functions

Subscribe