· 45:30
Nicolay Gerold: is one of the most
talked about topics at the moment.
And the key part is, instead of being a
one way pipeline, Agentic RAG basically
allows you to check whether at certain
points in the control flow of a search
pipeline you actually want to go
another route or you want to loop.
So basically, after the retriever,
you could ask Am I actually answering
the user question with my generation?
And if not, you can run a new retrieval,
or you can run a retrieval on a
different database you actually have.
You can also use LLMs, basically
decide, okay, what do you want to do?
Should I actually query?
So do I have a question about sales data?
Then I might even query a
regular database and use SQL.
Do I actually, am I looking for
more like textual information?
Then I might go to the vector database.
Is the query rather keyword based?
Then I might opt for a BM25.
And agentic RAG, for me at least, means
using LLMs in the control flow and at
certain points for decision making,
but not giving them free reigns.
And the key is knowing when to stop.
This is what we are talking
about today on how AI is built
as we continue our search series.
And we are talking to Stephen Batifol
from Zilliz and Stephen and I discuss
agentic reg and the future of search,
where the search system actually
can decide, what path it wants to
take to find the right answers.
And I would love to know,
what's your take on agentic rag?
let me know in the comments.
And whether you are
actually excited about it.
Because I'm still a little bit torn.
Otherwise, let's do it
Stephen Batifol: I think what's
nice when you have agentic RAG is.
Like the usual RAG one is
it's a one step basically.
So you do it, it runs once.
Then if you can find the
data, then it's fine.
And if it can't, then it'd be
like, yeah, I can't find it.
What's nice with agentic is you
have a loop, you can tell him
or tell it, it should be like,
okay, try to find the data.
If you don't have it, then go
somewhere else, and then you don't
have to run everything in sequence.
That's usually my favorite part.
Also can, it can understand
the query better.
That's like a part that I like as well.
Usually you can you can, if you
have like multiple questions in your
query, then usually you can also like.
I had to split it or they'd
be like, okay, are you sure
checking if the answer is correct?
And then if it's not, you go
back to the beginning or then
you check again for an answer.
That's usually my favorite part.
Nicolay Gerold: Are you actually
running like the traditional retrieval
pipeline in agentic RAG as well?
So that you create pre processing
retrieval re ranking, or are you
giving it complete flexibility?
Stephen Batifol: No, it depends.
Usually it depends on the, what
I want to do, but I like to have
it like a bit more structured.
So basically I like when my agent is doing
the part that I'm lazy to do or to define.
It's really like the way I do it.
It's usually, I'm like, yeah, that
part, if my agent could decide either
to do a search on the web or, a
vector search, then I would love that.
So usually that's the part I do, but
otherwise it's a bit more I like it
when, because agentic RAG is quite
cool, but it's also quite expensive.
So it's also like the part where I
tried to maybe sometimes reduce it, if
I have to put the cost down a bit or
something, then I would be like, Okay.
I'm going to actually do it myself
Nicolay Gerold: Yeah, and
Stephen Batifol: and latency as well.
Nicolay Gerold: I talked to Doug Turnbull
and he is starting like a counter movement
to RAG, which is GAR I'm not sure whether
you've seen it, the generate, yeah.
Stephen Batifol: I heard about it.
I heard about the, is it
like, it's not RAG, it's gar.
Nicolay Gerold: Yeah, and basically
the generative augmented retrieval,
and I think it, it would suit very
well into the agentic RAG as well that
you give like control to the LLM to
adjust the query and do especially
like the query pre processing.
Yeah.
Stephen Batifol: No, yeah.
I feel you're fully right on that one.
I think, they're like different
parts, but yeah, I quite like the gar.
I can't, I don't like the name but
I like the generative part of it.
I really like, as you said, the LLM
understands the query and then, decides
to try and break it up, maybe, or,
try to find like better better answers
or better, or try to actually, there's
like Haider as well, which also would
like create some different queries
if something is not clear, or create
some, would actually invent some
queries which is something that I find
quite interesting and fascinating.
Nicolay Gerold: What always
pops into my mind is I have the
different retrieval strategies and
for different queries, different
retrieval strategies might be useful.
Have you played around with routing
two different retrievers as well,
which implement different strategies?
Stephen Batifol: What
do you mean with that?
Nicolay Gerold: So basically I could have
one retrieval strategy, which basically
does a structured extraction as well.
So if I have a query with like
dates, for example, I'm extracting
the dates and using that to filter,
but I could also run just a hide.
So basically I'm taking the query,
generating a hypothetical document,
how it could look like, and then
just using that for retrieval.
So I think there are so many different
options and different queries
likely demand different routes to
Stephen Batifol: Yeah.
Yeah, I haven't tried this route.
Usually the thing, what I'm missing a
bit on at the moment is so I've been
a lot of agent, agentic RAGs and like
a lot of agents general, but then, I
don't really have and take the time
to you then, try all the different
retrieval resources, I would say.
So no, I haven't tried this one.
Nicolay Gerold: Yeah, because this is
probably for me like the most interesting
part because it takes so much work
to get a search engine to work, but
then it's pretty it's more fixed.
Once you do like a new implementation
or you do adjustments to like the
way it's searched, then you can
basically bring it in, but not really
like dynamically based on the query,
which would be way more interesting.
Stephen Batifol: Yeah, exactly.
I think it's one thing I'm excited
about, like in the future, actually.
Yesterday someone asked me, what
are you excited the most about 2025?
And I was like, it's actually
that, it's agentic workflows
were based on the query really.
It's okay, I'm going here.
I'm going here.
Which is make, would be more
interesting I would say.
Nicolay Gerold: Yeah.
Can you maybe walk us through like
the typical workflow from start to
end, basically from query to answer
how it would look like with a Chantic
Stephen Batifol: So what I have now
is really so you have your query,
then obviously then, you're going
to process it with like different
ways of creating chunks or then
you're going to create embeddings.
The different ways of doing it
Recently I think what's the name,
call Pali which like can create
embeddings where they take like images.
Like I've seen a lot of success in that
one, but just to answer the question.
So then you have your embedding and
then, you're going to process everything.
You're going to store your embeddings
directly in a vector database and
then you have your LLM that is
then, going to interact with it.
And the good part about
agentic crack is that.
Instead of being like a one way through,
then it's going to be like, okay, you
have your query, you have your bed
eggs, but then you have your user query,
and then you're really going to check.
Okay, I'm here.
I'm trying to answer the
question of this user and then.
Usually my workflows would be like
checking if the answer is correct.
So if we're actually answering
the question of the user, so then
I have, different workflows, my
agentic record is doing that.
And then it's also checking
like different sources.
If actually what I'm saying is
correct, that obviously depends on
the data I have, it's like sometimes
it's private data that I can't check.
But then if it's public data,
then I will actually do a check.
And then I usually also have a workflow
that is are you happy with the answer?
So it's actually asking the LLM itself.
Hey, are you happy with the answer?
And is it actually answering the question?
And that's, My typical
workflow for Identity Crack.
Nicolay Gerold: Yeah, and do you
actually allow loops as well?
Or is it basically just a workflow
after workflow that's triggered?
And there is a branching maybe
back to another position?
Which basically, yeah,
Stephen Batifol: It's
more like branches, yeah.
It's more Okay, I've decided this route.
So it's for now it's
really like a lot of ifs.
It's like a lot of ifs with like text.
Because I return some kind of text for
my workflow and like part of my graph.
So yeah that's more like
what I have at the moment.
I've tried to play, sorry, with react,
I don't know if you played with it.
And I don't know what's your success
with it, but for me, when I did it,
it was like, I was really impressed
at what it could do and, like all
the, like how it can resonate.
But then.
Also a lot of times you would just not
do it again, it's really like it's really
not predictable I had a 30 success rate
on one task for example with react.
Nicolay Gerold: So react
basically what it just means.
It's like a flow you're going
through all the time, like thought,
action, observation, and you
basically try to improve the answers.
I have like very limited success.
Or rather, I got it to work once, but
I didn't get it to work a second time.
Stephen Batifol: Yeah for me it was really
like And I remember I had a demo actually
where I was like, okay, I'm going to
showcase react, because it's really cool.
And what I like is that it's rewriting
the query as well as the user sometimes as
well as be like, okay, this was not clear.
I'm going to try to redo it.
But then, yeah, when I was preparing for
the demo, I was like, man, it works like
one time out of five, and I was like,
yeah, I can't have that for my demo.
It's too stressful.
And then when I talked to other
people, they were like similar
thing with React from them.
It seems like it's a cool
one, but not there yet.
Nicolay Gerold: I've always ended
up I tried it and I've always
ended up writing some kind of
workflow with a loop with branching.
But I haven't found a good
library for that as of yet.
Stephen Batifol: Yeah.
No, yeah.
It's React is, yeah.
Let's see.
Let's see if we have an improvement on it.
I think it'd be, it could be really good.
It could be very versatile, I would say.
Nicolay Gerold: Yeah, what do you
actually the different types of context
management I find often very interesting
because you quickly want to do like
the cost optimization degrade into
adding a cash adding a semantic cash or
Stephen Batifol: Yeah.
Nicolay Gerold: But also trying
to compress what you're feeding
into the model, the context.
Then you also have the message
store of the past interactions.
And this you have to manage what
are the different components
you bring into the agentic rack?
And how do you manage those?
Stephen Batifol: Yeah, this
one is a big one usually.
I feel like first it also gonna
depends, on the LLM you're gonna use.
They have like different context windows.
So that's like the big one.
But then usually how I do it is
that, The usual, the first classic,
I'm just going to store previous
interaction with the, with the agent.
All the discussions of the user and then
going to put that back into the context.
Which is nice at first, at first you're
happy at first, everything works, but
then at one point it becomes too big
or then the NLM is confused, because
there are like so many, there can be
like so many different topics as well
in the discussion, So I usually also
try to identify like some keywords and
topics and then maybe like sometime.
I haven't tried yet,
but I've read about it.
It's like maintaining a sliding window,
of the different interactions you had.
So that's maybe, like you have a
bigger context and then a smaller one.
And then for long term storage,
the context, usually it's more
okay, I don't keep that in memory.
I'm going to store that
somewhere in a vector database.
And then later on you can,
the agent would read it again.
That's usually what I felt
doing and also what I read.
Mostly of what other people are doing.
And then for bigger, like once the context
window gets really big, also sometimes
I've seen people use like embeddings,
actually, they create like embeddings
and then they store them and then you
have like embeddings of the previous
context usually, yeah, ways I've seen.
Nicolay Gerold: I think the agentic
movement it's so interesting because
you always, if you take it into
production, you end up using every part
of NLP, you have factuality checks,
you have some classification, you have
different compression, different ways of
compression, you have all kinds of search.
Stephen Batifol: Yes.
Yeah.
It's really for me, I
come back a long time ago.
I was working on an LP a long time ago.
And then I went to like software
engineering, machine learning
engineering, and now come back to it.
And it's very funny to see exactly as
you said, like you see all the different
components, then you see the scores, that
I used to work with, like the blue score,
for example, for translation and stuff.
And you see those and I find that to
be, yeah, very interesting, which, as
you said, you're going to use everything
like keyword extraction, topic detection,
and it's like the drift as well, context
drift, topic drift, like you're going to
have all of those at least at one point,
Nicolay Gerold: Yeah.
Stephen Batifol: interesting.
Nicolay Gerold: Do you think models
like, for example, the Chamba models,
which combine the transformer and
a state space models could be an
interesting way they can actually have
the long context transformer for all
the history, but more of the transaction
I run through, like the mamba models,
which have the state compression.
Stephen Batifol: See, I.
Don't have an opinion on this one
because I just saw it quickly.
So I won't like try to say something and
be like, I have an opinion on this one.
But in theory from what I
read quickly, it could be, but
really I haven't played with it.
Nicolay Gerold: Yeah.
The only thinking for me, it's
I haven't tried it out myself.
So I've used it, but only
the state space model.
And I think that's in the end what we
are doing with context compression,
but in context compression, we
are just throwing it back into
a transformer model again, have
Stephen Batifol: always going
back to the transformer, it's
at least for, mostly for now.
Nicolay Gerold: you done any like
more large scale project where you
really have a big vector database
or a big document store, which the
agent has to use or has access to.
Yeah.
Stephen Batifol: Yeah, I've done
it's actually, it's usually part
of some of my demos as well.
So when I, when I give a talk and
I'm going to talk about something,
I always like to have a live demo.
So I have, I'm building a
demo now with it's not that
big, it's 35 million vectors.
So it's not gigantic, but then my agent
is actually checking that one out.
So it's really I have different ways.
So it's like first my agent will try
to filter itself on itself, for I
have a query and then I'm asking, this
about, I don't know, Uber data in 2021.
So then I have I have a couple of.
A few shots tuning where then I have
I've seen, I've shown the agent how
to do filtering based on some queries.
So then it's going to create
a query itself to filter.
And then it's going to run
the vector search directly.
So that's, directly that's, it's
like whatever tool I'm using.
But that's, That's what
I've been trying to do.
And it's another one that I want to do
is that I want my agent to then also
be able to exploit partitioning after.
If possible, depending on, you can
put, like in a vector database,
you can partition your data
depending on, some different fields.
So if I have like data that is like
Wikipedia, for example, it's, you
can, you have it per language, so
you can partition it per language.
So then you could, if you
want, you could have your agent
going through the database, but
only for a specific language.
If you, if I'm asking you, I don't
know, what's the capital of Paris and
I'm asking you in English, you probably
want to have the answer in English
so that you can only filter through.
English in the first place.
So those are like the strategies
that I've done basically for
like large scale retrieval.
Nicolay Gerold: Yep.
Do you have played around with any
chunking metadata strategy, which
actually made it easier for the agent
to retrieve the relevant documents?
Stephen Batifol: Yeah.
So at the beginning I was doing
like smaller chunks, I would say.
Also because they, the LLMs
were not as capable as now.
Now I just throw the whole page usually.
I'm like.
My page is one chunk and I found it
to be way better for if I'm talking
about PDFs, for example, sorry.
And I found it to be better as
also have I'm integrating some
metadata chunks as well, which
are usually very useful or then.
If my document is a webpage then
I'm going to more split it into like
paragraphs, and then I'm going to
have a chunk of the paragraph, but
then a summary chunk of the page or
summary chunk, of another section.
Usually the way I do it is have the chunk
itself and then have a summary somewhere
off another one as well, which, so then,
you can like filter through the first one.
And then go through the
second one if needed.
Nicolay Gerold: Yeah.
Are you running like multiple retrieval
and then you're using those because
you have different representations.
So you can run retrieval on the
summary on the source document,
but also on the metadata,
Stephen Batifol: Yeah.
So like sales plug here, we
always, we support like multiple
vector search at the same time.
So it's up to 10.
So that's basically what I'm doing.
And then I'm using a ranker after.
So it's usually how I do it.
It's yeah.
Okay.
Going to do, make a search
up to 10, as I said.
And then then I'm using
my good old re ranker.
Nicolay Gerold: maybe even like
more difficult question, moving
more into the SQL generation.
How do you think will structured
information stored on like
regular databases and DuckDB
come into play for agents?
And if you've played around with
that, what do you think is like
the best representation to give
structured information to an agent?
Stephen Batifol: Good question.
When I play around it,
usually it's really,
usually I like, it's a
lot of prompt engineering.
I feel like.
So it's a lot of okay, being able
to, for the agent to understand
which one it should pick and when.
But so far it's really
okay, you have a query.
And I'm just talking about what
I tried with you have a query
and then you generate a SQL query
for the data and the result.
And then if like you have
another query that is a bit more
Natural language, I would say.
Then you go into unstructured.
And then I haven't, what I've
played with is really, then you
have two results and then you ask
the LLM to compare them basically.
And then you're like, okay, like
which one do you think is better?
And then if you think this one is better
then, you can just return this one.
That's the way I've done it so far.
But I haven't found like a, I
don't know if you have an opinion
on this one, but I haven't found a
way of is exactly structured data.
Like I have a result, this is the
correct answer you should give.
Usually I'm just asking the LLM.
Nicolay Gerold: So what I have had great
success with is, Especially when the data
is a little bit more complex and the data
schema is very normalized in the end.
The LLM really struggles with
the SQL generation because
it has to do too many joins.
There I had success with just basically
trying to figure out like what could be
valuable and creating materialized views.
For the different representations and
with those, I keep them pretty large
to include as much of the source data
as possible, but then it basically
can just take one table and basically
filter it down to what it needs.
And through that, you have to do a little
bit more work yourself, but you make it
easy for the LLM to basically retrieve
the different the different types of data.
And also the amount of query
types you usually hit for for
the, like I'm taking the data out.
It's limited because like often you
have something like growth rate.
You have absolute numbers.
You have relative numbers,
just broad strokes.
And you have it like from four different
organizations or four different,
products and stuff like that, but
there aren't like so many different
scenarios you would actually expose.
So it's if you can create five or six
materialized views and you're finished.
Stephen Batifol: Yeah.
Yeah.
I need to have a look at those a bit more.
I feel like what I wanna work
on a bit more is actually, like
the real production workload.
So it's actually having structured data
and unstructured data at the same time.
'cause I feel like that's what
most companies have actually,
Nicolay Gerold: yeah.
And it's like the unstructured data at the
moment is coming into the limelight really
hard because you suddenly can't use it.
So we start to ignore the structured
data, which should be easier to get.
Stephen Batifol: Yeah.
We're like, man, everything
in our structured, let's go.
Nicolay Gerold: And the, I
want to know the retrieval.
Especially for using the metadata,
what are you using for that?
Are you using something like unstructured
to extract the basically fields out of
the user query or what are you doing?
Stephen Batifol: Yeah it's basically
that's what I'm using at the moment.
It's it's really like I've
basically, I'll give my little
limb some examples of queries.
And then I'm like, okay, based off
those think of I don't remember my
prompt, but it's really think of
like things that are very important.
So like things of entities or different
things, or years or blah, blah,
blah, that could be used as metadata.
And then and then it's a lot of playing
around as well to actually create a
filter that works because, like for
metadata, it's really it's going to depend
on so many things, so many variables.
So I basically limit My metadata
filtering to a couple of fields because
then I'm like, otherwise it just becomes
too complex to actually generate it.
So then I have a bit of metadata,
sorry, in my connection and then it's
really going to be like, okay, I have
everything like creation time, last
access, the source and everything.
And then I'm going to be like, okay,
if I'm asking you, I don't know
what is that in the guardians, then
the source is likely guardians.
So then we have Like we have filtering
as well with prefix, midfix and suffix.
So then I can just play
around with those as well.
It could be like source contains Guardian.
Nicolay Gerold: Yeah, can you maybe
give a few examples on the like
different things you can filter
with prefix, suffix and metfix?
Stephen Batifol: yeah.
I mean with prefix it's really going
to be like, okay, like you want
something to start with, wait, sorry,
I confused the, always the one, but
if you have the percentage before.
I never remember which one is
which, but yeah, the percentage
before the word prefix should be.
Then it's okay, you want something to
like, finish with that kind of, so like
everything before you don't really care.
And then it fixes, like
it's going to be in between.
So you have to have if you have to have
the house in the middle which doesn't
really work, but you want to like,
midfix on oh, and you, then you could
find somehow the house in the middle
I'm terrible for finding examples life
but, and then suffix would be like I
don't know, for me, we're like, all the
suffix were like company name because my
documents were like, created in that way.
So it was always like year
underscore company name
underscore, no, then that's it.
And so then I would be like, okay, if you
find some kind of company, then you put it
at the end and then you just check before.
So that's like the way I would do it.
Nicolay Gerold: a good example, I
think, for suffix are either like
file extensions where did I get it
from, or also like legal forms of
companies, which is something I have
at the moment, like in Germany, you
Stephen Batifol: Oh, GMBH and
Nicolay Gerold: yeah, and stuff like that.
Which is a good example
of a suffix as well.
Stephen Batifol: Yeah.
I feel like, yeah, for those but yeah.
So that's the way I do it.
I found it so interesting.
So useful.
And also like having the not in
filter as well, it'd be like, okay,
like I want data that is actually
not in those metadata filters,
Nicolay Gerold: Yeah.
I'm really interested in like the
breaking mechanisms in the end.
Do you have
Stephen Batifol: The what,
Nicolay Gerold: into
the breaking mechanism?
I call it do you have stuff built into
the workflow where you actually just Break
the entire workflow, the generation and
say, Hey, we can't return the results.
Like for example, like the
query is really ambiguous.
I can't answer that.
Return it to the user.
The retrieval results are really bad.
Like we can't do anything like
maybe read, retrieve, or just
break off the entire flow.
Stephen Batifol: Yeah.
So with react, for example, when I
was trying, there was like some kind
of way where you can ask for stop to,
if you don't know what's happening.
You're like, Hey, let's take a break.
We're going to stop here.
Sometimes what I like is you have,
some question disambiguation.
So really being like.
Either try to infer what's happening
with the context you have around you.
Because sometimes it can help, like
you'd be a user and you're like, it I
want information about it or about that.
And then you can be like, what is that?
So then it's trying to infer what's
happening around I, if the query is
ambiguous or it doesn't make any sense,
what I've found success with usually
is like to read, divide the query.
So that's the first place if we
can, if we have multiple questions.
And then it's also actually sometimes
asking the LLM to ask the user, are
you happy with the answer or something?
And then, based on that then you can
like, you go on a different route.
That's also like a way I've done in the
past is really okay, sometimes, check if
there's like check, like periodically.
for joining me.
If the user is happy or whatever.
And then if yes, then we can continue.
And if no, then, like you have to
ask again and and then like correct
everything, but I feel like I like
the idea of asking the user, but it's
also like a lot of work for the user.
The user doesn't really
want that very often.
So that's that's usually a way.
And then.
Yeah, I think those are like the ways.
And then the last thing I can think of
is like multi, multi stage retrieval.
So instead of going all in,
we do it in multiple stages.
Nicolay Gerold: Yeah, I think
this is one of the most important
things actually to figure out, like
for one LLMs saying, I don't know
Stephen Batifol: Yeah.
Nicolay Gerold: when they don't
have any information on that.
And the second thing is like when to
actually ask, getting it to ask for more
information or interact with the user and
when to actually also break off that flow.
Like when does it have enough?
And this is so hard in text.
Because it's like really hard
to define what is enough.
Stephen Batifol: Yeah, it is.
And for me, when something I'm quite
happy about is that so build something
and, I'm like, So just like a simple
demo, it's like the financial of
Uber and Lyft, the usual, but then
I built like a whole thing around.
And then I'm like, I asked
my LLM, okay, what is the
financial data of Uber in 2023?
And I only have data of 2021 and 2022.
And it's Then he was like, Oh,
actually I don't have this data at all.
So I can't say, and I was like, yes,
he didn't try to do something else.
It would be like, that's that's
a part, but yeah, otherwise I
feel like it's still a hard one.
Nicolay Gerold: Yeah.
So there is like free business
idea for the people out there.
I think information intake with customers,
patients or whatever, if you can't
figure that out, because so much is
dependent on what have you already heard
or reacting based on the information,
which the user has already given you
to ask follow up questions and when you
actually don't want to go deeper because
you don't really get any more good
information, if you can't figure that out,
it has so many different applications.
Stephen Batifol: Yeah.
Nicolay Gerold: The interesting part
with agents, how do you evaluate the
success and also how do you evaluate
the retrieval in the agentic workflow?
Stephen Batifol: I think I
have two kind of ways usually.
So the first one is some kind of
task completion rate, so it was
like, actually, have you like
actually done finished the task?
You think you finished it at
least, maybe it's not successful,
but at least you finished it.
Because there's also a way of sometimes
agents, will just go on and on.
And then, they just never finish.
So that's one part.
Then also obviously asking, asking for
evaluation for the person using it.
So you know, all those like
some thumbs up and down.
Obviously no one, not everyone will do
the thumbs up and down, but that can help.
And then also like check check yourself
or check, with an other LM is like the
data that you know, the answer of, you
have some kind of all the data sets and
then you're going to check, every time you
like change your agent or something, is
it still answering my questions correctly?
Yes or no.
I feel like that's that can be a good one.
And then, I think there's also I
played around very quickly with it,
but for I guess, but I don't know
if it's really fantastic for agents.
But I know for right was really good.
So yeah, it can monitor like
different things, faithfulness
or bad retrieval, bad response.
Those were like quite useful.
I feel like
Nicolay Gerold: Yeah, I think RegAs
should be really suitable for agents.
If you have a retrieval component
as well, especially because they
have basically metrics for all the
retrieval stuff, but also for the
answer, which is coming out in the end.
Stephen Batifol: exactly.
It's not only yeah, the usual metrics.
So yeah, I feel like it's
quite a nice project as well.
Okay.
Nicolay Gerold: Yeah.
It's as soon as you move
into like workflows.
I think, especially with agents
and other LLMs the evaluation
part becomes a massive pain.
Because you have so many different steps
and you have to evaluate each of those.
And then the total system.
Stephen Batifol: Yeah.
But don't know what's happening here,
when like the whole like chat GPT
arrived and then everyone moved into
AI, everything was, everyone was
building, everyone is still building.
No one was really checking,
for like evaluation and stuff.
I think it's good that we're actually
coming back a bit to that, before in
the ML world, as we call it, the old
one would always evaluate something
before you put it to production.
At least I hope so.
So yeah, no it's good to have some
kind of variation as well there.
Nicolay Gerold: Yeah.
How do you think the multimodal stuff?
I talked to Jo Bergum yesterday
on CodePoly especially.
How do you think this will impact
like your agentic RAG workflows?
Stephen Batifol: I think this one
in particular, I'm really looking
forward to work with it more.
I feel like it's very promising.
Also using VLLMs in general can be
like, It seems to be very promising for
documents that are a bit more complex.
When you have some tables, are
you going to learn the schema of
the document and everything, you
don't want to lose everything.
I feel like those in my opinion, then I
know some people in my company have very
different opinions, but mine is that
it's really going to be like game changer
for especially for agents, especially
for over, over complex documents.
And yeah, I think.
We're gonna continue with the
normal one for you know, if you
have like I don't know, like an
HTML with no images and stuff.
Very simple, but for the rest, it's nice
to be able to keep the structure also
because parsers are very bad at that.
It's it's very hard.
So don't get me wrong.
I'm not blaming them, but like
parsing a PDF and stuff like
correctly can be very hard.
So yeah, like really hyped
about multimodal in the
future in CodePally as well.
Like it has a lot of potentials.
Yeah,
Nicolay Gerold: Who is smarter you
in building a parsing pipeline or
the LLM and actually interpreting
the unparsed information
Stephen Batifol: exactly.
And I feel like, yeah, I
think the LLM will win.
Nicolay Gerold: all the time for sure.
Nice.
Perfect.
What do you think is like
missing in the space?
What would you love to see built
or what have you experienced
actually missing in your stack
when you're building the workflows?
Thanks.
Stephen Batifol: I think at the moment
it's, I don't know, I was talking someone
yesterday they works at Intel, they're
talking about standards with something
a bit stupid, but at the moment, so I
come from like the MLOps world where
I arrived, there was no standards.
I left, it was clear, what you were doing.
I feel sometimes in my
workflow, that's what I miss.
I know it's not a very sexy one.
It can be a very interesting one.
I was like, okay, instead of I know it's
very hard, instead of playing around
with okay, I need an embedding model.
I need that.
I need that.
A lot of people don't
know what they need to do.
Some have some kind of like
standards can be, could be nice.
I think, or okay, you have complex PDFs.
Then maybe your best bet is, to go like
multi models to have, I think that's a
thing I'm like looking forward to a bit
of having I love learning new things all
the time, sometimes we're like, okay,
this one I know exactly how to do it.
I don't need to learn yet another thing.
Because something popped off yesterday.
I think that's one thing
I'm looking forward to bits.
Nicolay Gerold: I'm not sure
whether you will get that.
I think I
Stephen Batifol: Yeah, I'm
not sure but it's more like.
I think it's okay.
Maybe like I should rephrase, but it's
like not exactly, like you for the one
path because everyone is different, it's
more like having a rough idea, of okay
this is what should happen basically,
Nicolay Gerold: I think established
best practices not for What to
use, but how to approach a problem.
And I think that is what is established in
MLOps already, but what's missing in AI.
Stephen Batifol: Exactly.
That's better words.
That's basically what I wanted to say.
Nicolay Gerold: Yeah.
And more on the technical side, like
what is the solution that if it would be
built, you would adopt in a heartbeat.
Stephen Batifol: I feel like lots so
passing is one where sometimes, you
know, honestly, passing all those
different documents can be a pain where
I know there are passing solutions, but
I still feel like, they're struggling
for different things because everyone
is like still learning about that.
So there's so many new research, but
one way I'm like, man, This parsing
I'm sure it didn't miss anything, then
I'd be like, I think that's one part
where I'd be really willing to you pay
for it for like different use cases.
It's your data will be
there, it won't be missed.
So then it's more like,
how do you retrieve it?
And how do you search it?
But that's like a different part.
Like the parsing problem is it's always,
We always said, like garbage in garbage
out, it's the same, and I, and it's
the part that I used to not like when
I was a data scientist, cleaning the
data, making sure everything is cool.
I still don't like to do it.
Nicolay Gerold: And it's
that's a hard question because
it's overrated, underrated.
What do you think in DLL space
or agent space is overrated?
What is underrated?
Stephen Batifol: I think underrated is
embeddings model they were that like
make or break, I think, It's like I have
an example where back then, so yeah,
I was demoing something like chatting
with a Berlin parliament, so data is in
German and it's, for very formal German.
And I have a demo, which is, which
was using the embeddings of open AI
and, couldn't answer my questions
at all He would miss my questions.
He would be like, no, this
is not in the documents.
I can't answer.
And then I used the embeddings of
Jina AI which, are trained on English
and German or just one that I used.
And then it actually found my
answer, because my answer was written
clearly in German, but then OpenAI,
had like completely missed it.
So I think that's the part that
is, people are slowly going back
to it, but like actually embedding
models are quite important.
They can't just pick like any random one.
Overrated
for now, my opinion is
long context windows.
I would say it's overrated, like the
two millions, apparently of Google.
That's when Google showed it in their
blog posts, it looked fantastic.
But then when like other people
did some tests, it was still like,
eh, but it's still not there.
It's they miss some, some things that
is in there, I think it's amazing
when you do something one shot, you
have, I don't know, 300 pages PDF
that you have a question quickly.
You just want to go through it.
That's cool.
But if you, I think in my opinion for
like big company, I think you might be
missing something and I don't know if
you've seen, but Nvidia released a paper
yesterday or two days ago, which is called
OP RAG which is, I forgot what the OP is
Nicolay Gerold: How do you spell that?
Stephen Batifol: it's O P O P.
And RAG.
But it was like, you basically keep the
order of your chunks and everything.
And then they were showing that,
with that, it was like better.
It was better than the classic
RAG, but then it was also
better than Longotex windows.
Nicolay Gerold: Yeah, I can't
find it, I'm only hitting Oprah.
Stephen Batifol: Yeah, no.
I have to find it.
Yeah, it's everyone is always
trying new things and I'm always
like but yeah, I found it this one.
So yeah, in my opinion,
this one is a bit overrated.
Nicolay Gerold: Nice.
Perfect.
And if people want to start
building the stuff we just talked
about, where would you point them?
Stephen Batifol: If they
want to build it, you mean?
Nicolay Gerold: Yeah,
Stephen Batifol: I think
first is really okay, build
something you're interested in.
I was interested in the Berlin Parliament.
So then, that's what I did.
And then I don't know if everyone
would agree, but I would still think,
that using Lama index or something
could be useful if you want to start
first, because the documentation
explains things quite clearly, I
would say, like what you need to do.
And then, yeah, then it's also I
don't know if you played with Cloud
recently, but also Cloud can be like
amazing at explaining things like
when you start and I don't mean it in
a way of Hey, write the entire code.
It's more if you ask, it'd be
like, okay, can you explain
to me what it's doing here?
And then, like you still write the code if
you want, my opinion is that if you want
to learn, you still have to go through it.
But yeah, I would say.
Those can be very useful, like
asking LLMs, Hey, I have that.
And then it's going to tell you,
okay, write this code, but then
Hey, have it explained it, please.
Oh,
Nicolay Gerold: and what's next?
What's next for you?
What's next for Zilliz , Milvus
the horizon you can teaser?
Stephen Batifol: I know we're going
to have like full text search.
So that's a cool part, going back
this no, I think for us, it's
like, for me, it's going to work
more actually with multimodal.
I would say in general, like
what we talked about before, I
feel like it's going to, going
to only get better and better.
So that's something I
want to work more with.
And then also just in general, having
agents that are like, like taking
over things I don't like doing.
So if I could have an agent that could
call someone, I'd be the happiest
man on earth, I hate calling, so I'm
actually trying like now to build
on the sides, with some, something
like you have real time voice and
then, like actually trying to answer.
Some questions.
So yeah, maybe at one point I'll have,
I can showcase it, if it actually
works but yeah, otherwise multi model.
Nicolay Gerold: Nice.
And if people want to follow
along with you or with Syllis,
where can they do that?
Stephen Batifol: Yes, they can
find me on LinkedIn or on Twitter.
So it's Stefan24 on LinkedIn and StefanBT,
on Twitter, or X, I should rather say.
But a lot of things I do
is published on LinkedIn.
Also I'm based in Berlin, if you
want as well, I organize meetups.
So I organize unstructured data
meetups monthly, which is like
everything about unstructured data.
So what can we take away when
we're building search applications?
I think the first thing is
don't fall into the hype cycle.
I think like agents is at the
moment like peak hype in a lot
of the like LLM search area.
And agents in a lot of ways has lost a
little bit of its meaning because it's
used so freely for Different things for
workflows for the multi agents where
the LLM actually plans itself like the
react based agent, but you basically,
you have multiple steps the agent
has go through, he interacts with the
environment, he observes the state, and
then he makes a decision what to do next.
And then basically steps over time,
this like close to in the direction
of reinforcement learning, or like how
reinforcement learning defines an agent,
but the learning component is a little
bit missing, and optimizing basically
a policy which optimizes an objective
function, which isn't really there yet.
And I think that's, this is the main
thing missing for me from agentic rack
from actually working, you don't really
have a good way of optimizing agents,
because you don't have a clear way to
do performance analysis or debugging.
Because if the agent is planning freely,
and then there are like four different
agents or sub agents, which are doing
different tasks and have different tools,
interacting with each other, like, how do
you know where it actually went wrong or
where did it start to run wrong because
there can be a massive path dependence
and that's why I actually don't like the
free planning agents and for me Agentic
RAG where it's really interesting is
actually we interjected at some portions
of the control flow we decide based
on for example the user query which
database are we querying and which type
of search Are we using, for example,
and this allows you that you can, based
on the context, based on user profiles,
you can take different paths through
your query system and Also, what Stephen
mentioned with actually, based on the
answer you in the end generated, or you
could do the same based on the results
you have that you feed like in the top
10 results, you can actually decide,
okay, am I serving the information need
from the user and try to give the LLM as
much context about the user as possible.
And if not, you can basically try
a different query or rephrase the
query, do query expansion, whatever.
And this will add to the latency of the
query, but if you're not time constrained.
If you're, for example, not in an
e commerce application, but the
quality of the results matters more.
I think this is something that
could be very interesting.
Especially if it's very critical or data
or results that go in the direction of
if you're making decisions based on it,
where you actually should spend the time,
hey what are actually the relevant results
and what am I trying to answer here?
And I think asking the LLMs am
I serving an information need is
probably pretty tricky, because it
can be argued in many different ways
that a document might be relevant.
I would probably turn it around and
do like more of an error analysis.
What's wrong with the different
documents given the query.
And use that as feedback to maybe
improve the query to expansions or take
different query paths or expand the
query in, in certain ways or restrict it.
And through that, you probably also
get really valuable feedback data
for your search system which you
can use to improve it over time.
And yeah, I think for Eccentric
Reg, we, I've implemented it.
twice now, but you basically use it
for certain points in decision making.
Also an interesting project on
that is from Mixspread their
BMX project where they use LLMs
for query expansion, basically.
So they generate alternative queries
in the beginning which for me is
also like a form which would fall
into this agentic rack definition.
But the most used approach is
more like the dynamic routing.
Hey, where am I going?
Which database system, so that query
and how should use it in the end.
And yeah, that's basically it.
I would love to get your opinions
on ng req and whether you
implemented something like that
before and how successful you were.
Did you actually see quantitative
improvements in, in your
search metrics, but also in the
feedback you get from the users?
And this.
I think it's a little bit missing
in the Rack space in general.
Like it's not as metric driven as
traditional searches and we are getting
there like especially with Ragas,
for example, we are getting more and
more metric driven but it's also a
little bit trickier in my opinion,
because we are not just delivering the
results, we are delivering answers.
So we have to evaluate
multiple components.
At the same time.
And yeah let me know what you think.
Also if you like the
episodes, leave a like.
If you're on YouTube,
leave a review on Spotify.
It helps a lot.
And otherwise I will catch you next week.
So talk to you soon.
Listen to How AI Is Built using one of many popular podcasting apps or directories.