S2E18

#035 A Search System That Learns As You Use It (Agentic RAG)

December 13, 2024 · 45:30

Nicolay Gerold: is one of the most
talked about topics at the moment.

And the key part is, instead of being a
one way pipeline, Agentic RAG basically

allows you to check whether at certain
points in the control flow of a search

pipeline you actually want to go
another route or you want to loop.

So basically, after the retriever,
you could ask Am I actually answering

the user question with my generation?

And if not, you can run a new retrieval,
or you can run a retrieval on a

different database you actually have.

You can also use LLMs, basically
decide, okay, what do you want to do?

Should I actually query?

So do I have a question about sales data?

Then I might even query a
regular database and use SQL.

Do I actually, am I looking for
more like textual information?

Then I might go to the vector database.

Is the query rather keyword based?

Then I might opt for a BM25.

And agentic RAG, for me at least, means
using LLMs in the control flow and at

certain points for decision making,
but not giving them free reigns.

And the key is knowing when to stop.

This is what we are talking
about today on how AI is built

as we continue our search series.

And we are talking to Stephen Batifol
from Zilliz and Stephen and I discuss

agentic reg and the future of search,
where the search system actually

can decide, what path it wants to
take to find the right answers.

And I would love to know,
what's your take on agentic rag?

let me know in the comments.

And whether you are
actually excited about it.

Because I'm still a little bit torn.

Otherwise, let's do it

Stephen Batifol: I think what's
nice when you have agentic RAG is.

Like the usual RAG one is
it's a one step basically.

So you do it, it runs once.

Then if you can find the
data, then it's fine.

And if it can't, then it'd be
like, yeah, I can't find it.

What's nice with agentic is you
have a loop, you can tell him

or tell it, it should be like,
okay, try to find the data.

If you don't have it, then go
somewhere else, and then you don't

have to run everything in sequence.

That's usually my favorite part.

Also can, it can understand
the query better.

That's like a part that I like as well.

Usually you can you can, if you
have like multiple questions in your

query, then usually you can also like.

I had to split it or they'd
be like, okay, are you sure

checking if the answer is correct?

And then if it's not, you go
back to the beginning or then

you check again for an answer.

That's usually my favorite part.

Nicolay Gerold: Are you actually
running like the traditional retrieval

pipeline in agentic RAG as well?

So that you create pre processing
retrieval re ranking, or are you

giving it complete flexibility?

Stephen Batifol: No, it depends.

Usually it depends on the, what
I want to do, but I like to have

it like a bit more structured.

So basically I like when my agent is doing
the part that I'm lazy to do or to define.

It's really like the way I do it.

It's usually, I'm like, yeah, that
part, if my agent could decide either

to do a search on the web or, a
vector search, then I would love that.

So usually that's the part I do, but
otherwise it's a bit more I like it

when, because agentic RAG is quite
cool, but it's also quite expensive.

So it's also like the part where I
tried to maybe sometimes reduce it, if

I have to put the cost down a bit or
something, then I would be like, Okay.

I'm going to actually do it myself

Nicolay Gerold: Yeah, and

Stephen Batifol: and latency as well.

Nicolay Gerold: I talked to Doug Turnbull
and he is starting like a counter movement

to RAG, which is GAR I'm not sure whether
you've seen it, the generate, yeah.

Stephen Batifol: I heard about it.

I heard about the, is it
like, it's not RAG, it's gar.

Nicolay Gerold: Yeah, and basically
the generative augmented retrieval,

and I think it, it would suit very
well into the agentic RAG as well that

you give like control to the LLM to
adjust the query and do especially

like the query pre processing.

Yeah.

Stephen Batifol: No, yeah.

I feel you're fully right on that one.

I think, they're like different
parts, but yeah, I quite like the gar.

I can't, I don't like the name but
I like the generative part of it.

I really like, as you said, the LLM
understands the query and then, decides

to try and break it up, maybe, or,
try to find like better better answers

or better, or try to actually, there's
like Haider as well, which also would

like create some different queries
if something is not clear, or create

some, would actually invent some
queries which is something that I find

quite interesting and fascinating.

Nicolay Gerold: What always
pops into my mind is I have the

different retrieval strategies and
for different queries, different

retrieval strategies might be useful.

Have you played around with routing
two different retrievers as well,

which implement different strategies?

Stephen Batifol: What
do you mean with that?

Nicolay Gerold: So basically I could have
one retrieval strategy, which basically

does a structured extraction as well.

So if I have a query with like
dates, for example, I'm extracting

the dates and using that to filter,
but I could also run just a hide.

So basically I'm taking the query,
generating a hypothetical document,

how it could look like, and then
just using that for retrieval.

So I think there are so many different
options and different queries

likely demand different routes to

Stephen Batifol: Yeah.

Yeah, I haven't tried this route.

Usually the thing, what I'm missing a
bit on at the moment is so I've been

a lot of agent, agentic RAGs and like
a lot of agents general, but then, I

don't really have and take the time
to you then, try all the different

retrieval resources, I would say.

So no, I haven't tried this one.

Nicolay Gerold: Yeah, because this is
probably for me like the most interesting

part because it takes so much work
to get a search engine to work, but

then it's pretty it's more fixed.

Once you do like a new implementation
or you do adjustments to like the

way it's searched, then you can
basically bring it in, but not really

like dynamically based on the query,
which would be way more interesting.

Stephen Batifol: Yeah, exactly.

I think it's one thing I'm excited
about, like in the future, actually.

Yesterday someone asked me, what
are you excited the most about 2025?

And I was like, it's actually
that, it's agentic workflows

were based on the query really.

It's okay, I'm going here.

I'm going here.

Which is make, would be more
interesting I would say.

Nicolay Gerold: Yeah.

Can you maybe walk us through like
the typical workflow from start to

end, basically from query to answer
how it would look like with a Chantic

Stephen Batifol: So what I have now
is really so you have your query,

then obviously then, you're going
to process it with like different

ways of creating chunks or then
you're going to create embeddings.

The different ways of doing it
Recently I think what's the name,

call Pali which like can create
embeddings where they take like images.

Like I've seen a lot of success in that
one, but just to answer the question.

So then you have your embedding and
then, you're going to process everything.

You're going to store your embeddings
directly in a vector database and

then you have your LLM that is
then, going to interact with it.

And the good part about
agentic crack is that.

Instead of being like a one way through,
then it's going to be like, okay, you

have your query, you have your bed
eggs, but then you have your user query,

and then you're really going to check.

Okay, I'm here.

I'm trying to answer the
question of this user and then.

Usually my workflows would be like
checking if the answer is correct.

So if we're actually answering
the question of the user, so then

I have, different workflows, my
agentic record is doing that.

And then it's also checking
like different sources.

If actually what I'm saying is
correct, that obviously depends on

the data I have, it's like sometimes
it's private data that I can't check.

But then if it's public data,
then I will actually do a check.

And then I usually also have a workflow
that is are you happy with the answer?

So it's actually asking the LLM itself.

Hey, are you happy with the answer?

And is it actually answering the question?

And that's, My typical
workflow for Identity Crack.

Nicolay Gerold: Yeah, and do you
actually allow loops as well?

Or is it basically just a workflow
after workflow that's triggered?

And there is a branching maybe
back to another position?

Which basically, yeah,

Stephen Batifol: It's
more like branches, yeah.

It's more Okay, I've decided this route.

So it's for now it's
really like a lot of ifs.

It's like a lot of ifs with like text.

Because I return some kind of text for
my workflow and like part of my graph.

So yeah that's more like
what I have at the moment.

I've tried to play, sorry, with react,
I don't know if you played with it.

And I don't know what's your success
with it, but for me, when I did it,

it was like, I was really impressed
at what it could do and, like all

the, like how it can resonate.

But then.

Also a lot of times you would just not
do it again, it's really like it's really

not predictable I had a 30 success rate
on one task for example with react.

Nicolay Gerold: So react
basically what it just means.

It's like a flow you're going
through all the time, like thought,

action, observation, and you
basically try to improve the answers.

I have like very limited success.

Or rather, I got it to work once, but
I didn't get it to work a second time.

Stephen Batifol: Yeah for me it was really
like And I remember I had a demo actually

where I was like, okay, I'm going to
showcase react, because it's really cool.

And what I like is that it's rewriting
the query as well as the user sometimes as

well as be like, okay, this was not clear.

I'm going to try to redo it.

But then, yeah, when I was preparing for
the demo, I was like, man, it works like

one time out of five, and I was like,
yeah, I can't have that for my demo.

It's too stressful.

And then when I talked to other
people, they were like similar

thing with React from them.

It seems like it's a cool
one, but not there yet.

Nicolay Gerold: I've always ended
up I tried it and I've always

ended up writing some kind of
workflow with a loop with branching.

But I haven't found a good
library for that as of yet.

Stephen Batifol: Yeah.

No, yeah.

It's React is, yeah.

Let's see.

Let's see if we have an improvement on it.

I think it'd be, it could be really good.

It could be very versatile, I would say.

Nicolay Gerold: Yeah, what do you
actually the different types of context

management I find often very interesting
because you quickly want to do like

the cost optimization degrade into
adding a cash adding a semantic cash or

Stephen Batifol: Yeah.

Nicolay Gerold: But also trying
to compress what you're feeding

into the model, the context.

Then you also have the message
store of the past interactions.

And this you have to manage what
are the different components

you bring into the agentic rack?

And how do you manage those?

Stephen Batifol: Yeah, this
one is a big one usually.

I feel like first it also gonna
depends, on the LLM you're gonna use.

They have like different context windows.

So that's like the big one.

But then usually how I do it is
that, The usual, the first classic,

I'm just going to store previous
interaction with the, with the agent.

All the discussions of the user and then
going to put that back into the context.

Which is nice at first, at first you're
happy at first, everything works, but

then at one point it becomes too big
or then the NLM is confused, because

there are like so many, there can be
like so many different topics as well

in the discussion, So I usually also
try to identify like some keywords and

topics and then maybe like sometime.

I haven't tried yet,
but I've read about it.

It's like maintaining a sliding window,
of the different interactions you had.

So that's maybe, like you have a
bigger context and then a smaller one.

And then for long term storage,
the context, usually it's more

okay, I don't keep that in memory.

I'm going to store that
somewhere in a vector database.

And then later on you can,
the agent would read it again.

That's usually what I felt
doing and also what I read.

Mostly of what other people are doing.

And then for bigger, like once the context
window gets really big, also sometimes

I've seen people use like embeddings,
actually, they create like embeddings

and then they store them and then you
have like embeddings of the previous

context usually, yeah, ways I've seen.

Nicolay Gerold: I think the agentic
movement it's so interesting because

you always, if you take it into
production, you end up using every part

of NLP, you have factuality checks,
you have some classification, you have

different compression, different ways of
compression, you have all kinds of search.

Stephen Batifol: Yes.

Yeah.

It's really for me, I
come back a long time ago.

I was working on an LP a long time ago.

And then I went to like software
engineering, machine learning

engineering, and now come back to it.

And it's very funny to see exactly as
you said, like you see all the different

components, then you see the scores, that
I used to work with, like the blue score,

for example, for translation and stuff.

And you see those and I find that to
be, yeah, very interesting, which, as

you said, you're going to use everything
like keyword extraction, topic detection,

and it's like the drift as well, context
drift, topic drift, like you're going to

have all of those at least at one point,

Nicolay Gerold: Yeah.

Stephen Batifol: interesting.

Nicolay Gerold: Do you think models
like, for example, the Chamba models,

which combine the transformer and
a state space models could be an

interesting way they can actually have
the long context transformer for all

the history, but more of the transaction
I run through, like the mamba models,

which have the state compression.

Stephen Batifol: See, I.

Don't have an opinion on this one
because I just saw it quickly.

So I won't like try to say something and
be like, I have an opinion on this one.

But in theory from what I
read quickly, it could be, but

really I haven't played with it.

Nicolay Gerold: Yeah.

The only thinking for me, it's
I haven't tried it out myself.

So I've used it, but only
the state space model.

And I think that's in the end what we
are doing with context compression,

but in context compression, we
are just throwing it back into

a transformer model again, have

Stephen Batifol: always going
back to the transformer, it's

at least for, mostly for now.

Nicolay Gerold: you done any like
more large scale project where you

really have a big vector database
or a big document store, which the

agent has to use or has access to.

Yeah.

Stephen Batifol: Yeah, I've done
it's actually, it's usually part

of some of my demos as well.

So when I, when I give a talk and
I'm going to talk about something,

I always like to have a live demo.

So I have, I'm building a
demo now with it's not that

big, it's 35 million vectors.

So it's not gigantic, but then my agent
is actually checking that one out.

So it's really I have different ways.

So it's like first my agent will try
to filter itself on itself, for I

have a query and then I'm asking, this
about, I don't know, Uber data in 2021.

So then I have I have a couple of.

A few shots tuning where then I have
I've seen, I've shown the agent how

to do filtering based on some queries.

So then it's going to create
a query itself to filter.

And then it's going to run
the vector search directly.

So that's, directly that's, it's
like whatever tool I'm using.

But that's, That's what
I've been trying to do.

And it's another one that I want to do
is that I want my agent to then also

be able to exploit partitioning after.

If possible, depending on, you can
put, like in a vector database,

you can partition your data
depending on, some different fields.

So if I have like data that is like
Wikipedia, for example, it's, you

can, you have it per language, so
you can partition it per language.

So then you could, if you
want, you could have your agent

going through the database, but
only for a specific language.

If you, if I'm asking you, I don't
know, what's the capital of Paris and

I'm asking you in English, you probably
want to have the answer in English

so that you can only filter through.

English in the first place.

So those are like the strategies
that I've done basically for

like large scale retrieval.

Nicolay Gerold: Yep.

Do you have played around with any
chunking metadata strategy, which

actually made it easier for the agent
to retrieve the relevant documents?

Stephen Batifol: Yeah.

So at the beginning I was doing
like smaller chunks, I would say.

Also because they, the LLMs
were not as capable as now.

Now I just throw the whole page usually.

I'm like.

My page is one chunk and I found it
to be way better for if I'm talking

about PDFs, for example, sorry.

And I found it to be better as
also have I'm integrating some

metadata chunks as well, which
are usually very useful or then.

If my document is a webpage then
I'm going to more split it into like

paragraphs, and then I'm going to
have a chunk of the paragraph, but

then a summary chunk of the page or
summary chunk, of another section.

Usually the way I do it is have the chunk
itself and then have a summary somewhere

off another one as well, which, so then,
you can like filter through the first one.

And then go through the
second one if needed.

Nicolay Gerold: Yeah.

Are you running like multiple retrieval
and then you're using those because

you have different representations.

So you can run retrieval on the
summary on the source document,

but also on the metadata,

Stephen Batifol: Yeah.

So like sales plug here, we
always, we support like multiple

vector search at the same time.

So it's up to 10.

So that's basically what I'm doing.

And then I'm using a ranker after.

So it's usually how I do it.

It's yeah.

Okay.

Going to do, make a search
up to 10, as I said.

And then then I'm using
my good old re ranker.

Nicolay Gerold: maybe even like
more difficult question, moving

more into the SQL generation.

How do you think will structured
information stored on like

regular databases and DuckDB
come into play for agents?

And if you've played around with
that, what do you think is like

the best representation to give
structured information to an agent?

Stephen Batifol: Good question.

When I play around it,
usually it's really,

usually I like, it's a
lot of prompt engineering.

I feel like.

So it's a lot of okay, being able
to, for the agent to understand

which one it should pick and when.

But so far it's really
okay, you have a query.

And I'm just talking about what
I tried with you have a query

and then you generate a SQL query
for the data and the result.

And then if like you have
another query that is a bit more

Natural language, I would say.

Then you go into unstructured.

And then I haven't, what I've
played with is really, then you

have two results and then you ask
the LLM to compare them basically.

And then you're like, okay, like
which one do you think is better?

And then if you think this one is better
then, you can just return this one.

That's the way I've done it so far.

But I haven't found like a, I
don't know if you have an opinion

on this one, but I haven't found a
way of is exactly structured data.

Like I have a result, this is the
correct answer you should give.

Usually I'm just asking the LLM.

Nicolay Gerold: So what I have had great
success with is, Especially when the data

is a little bit more complex and the data
schema is very normalized in the end.

The LLM really struggles with
the SQL generation because

it has to do too many joins.

There I had success with just basically
trying to figure out like what could be

valuable and creating materialized views.

For the different representations and
with those, I keep them pretty large

to include as much of the source data
as possible, but then it basically

can just take one table and basically
filter it down to what it needs.

And through that, you have to do a little
bit more work yourself, but you make it

easy for the LLM to basically retrieve
the different the different types of data.

And also the amount of query
types you usually hit for for

the, like I'm taking the data out.

It's limited because like often you
have something like growth rate.

You have absolute numbers.

You have relative numbers,
just broad strokes.

And you have it like from four different
organizations or four different,

products and stuff like that, but
there aren't like so many different

scenarios you would actually expose.

So it's if you can create five or six
materialized views and you're finished.

Stephen Batifol: Yeah.

Yeah.

I need to have a look at those a bit more.

I feel like what I wanna work
on a bit more is actually, like

the real production workload.

So it's actually having structured data
and unstructured data at the same time.

'cause I feel like that's what
most companies have actually,

Nicolay Gerold: yeah.

And it's like the unstructured data at the
moment is coming into the limelight really

hard because you suddenly can't use it.

So we start to ignore the structured
data, which should be easier to get.

Stephen Batifol: Yeah.

We're like, man, everything
in our structured, let's go.

Nicolay Gerold: And the, I
want to know the retrieval.

Especially for using the metadata,
what are you using for that?

Are you using something like unstructured
to extract the basically fields out of

the user query or what are you doing?

Stephen Batifol: Yeah it's basically
that's what I'm using at the moment.

It's it's really like I've
basically, I'll give my little

limb some examples of queries.

And then I'm like, okay, based off
those think of I don't remember my

prompt, but it's really think of
like things that are very important.

So like things of entities or different
things, or years or blah, blah,

blah, that could be used as metadata.

And then and then it's a lot of playing
around as well to actually create a

filter that works because, like for
metadata, it's really it's going to depend

on so many things, so many variables.

So I basically limit My metadata
filtering to a couple of fields because

then I'm like, otherwise it just becomes
too complex to actually generate it.

So then I have a bit of metadata,
sorry, in my connection and then it's

really going to be like, okay, I have
everything like creation time, last

access, the source and everything.

And then I'm going to be like, okay,
if I'm asking you, I don't know

what is that in the guardians, then
the source is likely guardians.

So then we have Like we have filtering
as well with prefix, midfix and suffix.

So then I can just play
around with those as well.

It could be like source contains Guardian.

Nicolay Gerold: Yeah, can you maybe
give a few examples on the like

different things you can filter
with prefix, suffix and metfix?

Stephen Batifol: yeah.

I mean with prefix it's really going
to be like, okay, like you want

something to start with, wait, sorry,
I confused the, always the one, but

if you have the percentage before.

I never remember which one is
which, but yeah, the percentage

before the word prefix should be.

Then it's okay, you want something to
like, finish with that kind of, so like

everything before you don't really care.

And then it fixes, like
it's going to be in between.

So you have to have if you have to have
the house in the middle which doesn't

really work, but you want to like,
midfix on oh, and you, then you could

find somehow the house in the middle
I'm terrible for finding examples life

but, and then suffix would be like I
don't know, for me, we're like, all the

suffix were like company name because my
documents were like, created in that way.

So it was always like year
underscore company name

underscore, no, then that's it.

And so then I would be like, okay, if you
find some kind of company, then you put it

at the end and then you just check before.

So that's like the way I would do it.

Nicolay Gerold: a good example, I
think, for suffix are either like

file extensions where did I get it
from, or also like legal forms of

companies, which is something I have
at the moment, like in Germany, you

Stephen Batifol: Oh, GMBH and

Nicolay Gerold: yeah, and stuff like that.

Which is a good example
of a suffix as well.

Stephen Batifol: Yeah.

I feel like, yeah, for those but yeah.

So that's the way I do it.

I found it so interesting.

So useful.

And also like having the not in
filter as well, it'd be like, okay,

like I want data that is actually
not in those metadata filters,

Nicolay Gerold: Yeah.

I'm really interested in like the
breaking mechanisms in the end.

Do you have

Stephen Batifol: The what,

Nicolay Gerold: into
the breaking mechanism?

I call it do you have stuff built into
the workflow where you actually just Break

the entire workflow, the generation and
say, Hey, we can't return the results.

Like for example, like the
query is really ambiguous.

I can't answer that.

Return it to the user.

The retrieval results are really bad.

Like we can't do anything like
maybe read, retrieve, or just

break off the entire flow.

Stephen Batifol: Yeah.

So with react, for example, when I
was trying, there was like some kind

of way where you can ask for stop to,
if you don't know what's happening.

You're like, Hey, let's take a break.

We're going to stop here.

Sometimes what I like is you have,
some question disambiguation.

So really being like.

Either try to infer what's happening
with the context you have around you.

Because sometimes it can help, like
you'd be a user and you're like, it I

want information about it or about that.

And then you can be like, what is that?

So then it's trying to infer what's
happening around I, if the query is

ambiguous or it doesn't make any sense,
what I've found success with usually

is like to read, divide the query.

So that's the first place if we
can, if we have multiple questions.

And then it's also actually sometimes
asking the LLM to ask the user, are

you happy with the answer or something?

And then, based on that then you can
like, you go on a different route.

That's also like a way I've done in the
past is really okay, sometimes, check if

there's like check, like periodically.

for joining me.

If the user is happy or whatever.

And then if yes, then we can continue.

And if no, then, like you have to
ask again and and then like correct

everything, but I feel like I like
the idea of asking the user, but it's

also like a lot of work for the user.

The user doesn't really
want that very often.

So that's that's usually a way.

And then.

Yeah, I think those are like the ways.

And then the last thing I can think of
is like multi, multi stage retrieval.

So instead of going all in,
we do it in multiple stages.

Nicolay Gerold: Yeah, I think
this is one of the most important

things actually to figure out, like
for one LLMs saying, I don't know

Stephen Batifol: Yeah.

Nicolay Gerold: when they don't
have any information on that.

And the second thing is like when to
actually ask, getting it to ask for more

information or interact with the user and
when to actually also break off that flow.

Like when does it have enough?

And this is so hard in text.

Because it's like really hard
to define what is enough.

Stephen Batifol: Yeah, it is.

And for me, when something I'm quite
happy about is that so build something

and, I'm like, So just like a simple
demo, it's like the financial of

Uber and Lyft, the usual, but then
I built like a whole thing around.

And then I'm like, I asked
my LLM, okay, what is the

financial data of Uber in 2023?

And I only have data of 2021 and 2022.

And it's Then he was like, Oh,
actually I don't have this data at all.

So I can't say, and I was like, yes,
he didn't try to do something else.

It would be like, that's that's
a part, but yeah, otherwise I

feel like it's still a hard one.

Nicolay Gerold: Yeah.

So there is like free business
idea for the people out there.

I think information intake with customers,
patients or whatever, if you can't

figure that out, because so much is
dependent on what have you already heard

or reacting based on the information,
which the user has already given you

to ask follow up questions and when you
actually don't want to go deeper because

you don't really get any more good
information, if you can't figure that out,

it has so many different applications.

Stephen Batifol: Yeah.

Nicolay Gerold: The interesting part
with agents, how do you evaluate the

success and also how do you evaluate
the retrieval in the agentic workflow?

Stephen Batifol: I think I
have two kind of ways usually.

So the first one is some kind of
task completion rate, so it was

like, actually, have you like
actually done finished the task?

You think you finished it at
least, maybe it's not successful,

but at least you finished it.

Because there's also a way of sometimes
agents, will just go on and on.

And then, they just never finish.

So that's one part.

Then also obviously asking, asking for
evaluation for the person using it.

So you know, all those like
some thumbs up and down.

Obviously no one, not everyone will do
the thumbs up and down, but that can help.

And then also like check check yourself
or check, with an other LM is like the

data that you know, the answer of, you
have some kind of all the data sets and

then you're going to check, every time you
like change your agent or something, is

it still answering my questions correctly?

Yes or no.

I feel like that's that can be a good one.

And then, I think there's also I
played around very quickly with it,

but for I guess, but I don't know
if it's really fantastic for agents.

But I know for right was really good.

So yeah, it can monitor like
different things, faithfulness

or bad retrieval, bad response.

Those were like quite useful.

I feel like

Nicolay Gerold: Yeah, I think RegAs
should be really suitable for agents.

If you have a retrieval component
as well, especially because they

have basically metrics for all the
retrieval stuff, but also for the

answer, which is coming out in the end.

Stephen Batifol: exactly.

It's not only yeah, the usual metrics.

So yeah, I feel like it's
quite a nice project as well.

Okay.

Nicolay Gerold: Yeah.

It's as soon as you move
into like workflows.

I think, especially with agents
and other LLMs the evaluation

part becomes a massive pain.

Because you have so many different steps
and you have to evaluate each of those.

And then the total system.

Stephen Batifol: Yeah.

But don't know what's happening here,
when like the whole like chat GPT

arrived and then everyone moved into
AI, everything was, everyone was

building, everyone is still building.

No one was really checking,
for like evaluation and stuff.

I think it's good that we're actually
coming back a bit to that, before in

the ML world, as we call it, the old
one would always evaluate something

before you put it to production.

At least I hope so.

So yeah, no it's good to have some
kind of variation as well there.

Nicolay Gerold: Yeah.

How do you think the multimodal stuff?

I talked to Jo Bergum yesterday
on CodePoly especially.

How do you think this will impact
like your agentic RAG workflows?

Stephen Batifol: I think this one
in particular, I'm really looking

forward to work with it more.

I feel like it's very promising.

Also using VLLMs in general can be
like, It seems to be very promising for

documents that are a bit more complex.

When you have some tables, are
you going to learn the schema of

the document and everything, you
don't want to lose everything.

I feel like those in my opinion, then I
know some people in my company have very

different opinions, but mine is that
it's really going to be like game changer

for especially for agents, especially
for over, over complex documents.

And yeah, I think.

We're gonna continue with the
normal one for you know, if you

have like I don't know, like an
HTML with no images and stuff.

Very simple, but for the rest, it's nice
to be able to keep the structure also

because parsers are very bad at that.

It's it's very hard.

So don't get me wrong.

I'm not blaming them, but like
parsing a PDF and stuff like

correctly can be very hard.

So yeah, like really hyped
about multimodal in the

future in CodePally as well.

Like it has a lot of potentials.

Yeah,

Nicolay Gerold: Who is smarter you
in building a parsing pipeline or

the LLM and actually interpreting
the unparsed information

Stephen Batifol: exactly.

And I feel like, yeah, I
think the LLM will win.

Nicolay Gerold: all the time for sure.

Nice.

Perfect.

What do you think is like
missing in the space?

What would you love to see built
or what have you experienced

actually missing in your stack
when you're building the workflows?

Thanks.

Stephen Batifol: I think at the moment
it's, I don't know, I was talking someone

yesterday they works at Intel, they're
talking about standards with something

a bit stupid, but at the moment, so I
come from like the MLOps world where

I arrived, there was no standards.

I left, it was clear, what you were doing.

I feel sometimes in my
workflow, that's what I miss.

I know it's not a very sexy one.

It can be a very interesting one.

I was like, okay, instead of I know it's
very hard, instead of playing around

with okay, I need an embedding model.

I need that.

A lot of people don't
know what they need to do.

Some have some kind of like
standards can be, could be nice.

I think, or okay, you have complex PDFs.

Then maybe your best bet is, to go like
multi models to have, I think that's a

thing I'm like looking forward to a bit
of having I love learning new things all

the time, sometimes we're like, okay,
this one I know exactly how to do it.

I don't need to learn yet another thing.

Because something popped off yesterday.

I think that's one thing
I'm looking forward to bits.

Nicolay Gerold: I'm not sure
whether you will get that.

I think I

Stephen Batifol: Yeah, I'm
not sure but it's more like.

I think it's okay.

Maybe like I should rephrase, but it's
like not exactly, like you for the one

path because everyone is different, it's
more like having a rough idea, of okay

this is what should happen basically,

Nicolay Gerold: I think established
best practices not for What to

use, but how to approach a problem.

And I think that is what is established in
MLOps already, but what's missing in AI.

Stephen Batifol: Exactly.

That's better words.

That's basically what I wanted to say.

Nicolay Gerold: Yeah.

And more on the technical side, like
what is the solution that if it would be

built, you would adopt in a heartbeat.

Stephen Batifol: I feel like lots so
passing is one where sometimes, you

know, honestly, passing all those
different documents can be a pain where

I know there are passing solutions, but
I still feel like, they're struggling

for different things because everyone
is like still learning about that.

So there's so many new research, but
one way I'm like, man, This parsing

I'm sure it didn't miss anything, then
I'd be like, I think that's one part

where I'd be really willing to you pay
for it for like different use cases.

It's your data will be
there, it won't be missed.

So then it's more like,
how do you retrieve it?

And how do you search it?

But that's like a different part.

Like the parsing problem is it's always,
We always said, like garbage in garbage

out, it's the same, and I, and it's
the part that I used to not like when

I was a data scientist, cleaning the
data, making sure everything is cool.

I still don't like to do it.

Nicolay Gerold: And it's
that's a hard question because

it's overrated, underrated.

What do you think in DLL space
or agent space is overrated?

What is underrated?

Stephen Batifol: I think underrated is
embeddings model they were that like

make or break, I think, It's like I have
an example where back then, so yeah,

I was demoing something like chatting
with a Berlin parliament, so data is in

German and it's, for very formal German.

And I have a demo, which is, which
was using the embeddings of open AI

and, couldn't answer my questions
at all He would miss my questions.

He would be like, no, this
is not in the documents.

I can't answer.

And then I used the embeddings of
Jina AI which, are trained on English

and German or just one that I used.

And then it actually found my
answer, because my answer was written

clearly in German, but then OpenAI,
had like completely missed it.

So I think that's the part that
is, people are slowly going back

to it, but like actually embedding
models are quite important.

They can't just pick like any random one.

Overrated

for now, my opinion is
long context windows.

I would say it's overrated, like the
two millions, apparently of Google.

That's when Google showed it in their
blog posts, it looked fantastic.

But then when like other people
did some tests, it was still like,

eh, but it's still not there.

It's they miss some, some things that
is in there, I think it's amazing

when you do something one shot, you
have, I don't know, 300 pages PDF

that you have a question quickly.

You just want to go through it.

That's cool.

But if you, I think in my opinion for
like big company, I think you might be

missing something and I don't know if
you've seen, but Nvidia released a paper

yesterday or two days ago, which is called
OP RAG which is, I forgot what the OP is

Nicolay Gerold: How do you spell that?

Stephen Batifol: it's O P O P.

And RAG.

But it was like, you basically keep the
order of your chunks and everything.

And then they were showing that,
with that, it was like better.

It was better than the classic
RAG, but then it was also

better than Longotex windows.

Nicolay Gerold: Yeah, I can't
find it, I'm only hitting Oprah.

Stephen Batifol: Yeah, no.

I have to find it.

Yeah, it's everyone is always
trying new things and I'm always

like but yeah, I found it this one.

So yeah, in my opinion,
this one is a bit overrated.

Nicolay Gerold: Nice.

Perfect.

And if people want to start
building the stuff we just talked

about, where would you point them?

Stephen Batifol: If they
want to build it, you mean?

Nicolay Gerold: Yeah,

Stephen Batifol: I think
first is really okay, build

something you're interested in.

I was interested in the Berlin Parliament.

So then, that's what I did.

And then I don't know if everyone
would agree, but I would still think,

that using Lama index or something
could be useful if you want to start

first, because the documentation
explains things quite clearly, I

would say, like what you need to do.

And then, yeah, then it's also I
don't know if you played with Cloud

recently, but also Cloud can be like
amazing at explaining things like

when you start and I don't mean it in
a way of Hey, write the entire code.

It's more if you ask, it'd be
like, okay, can you explain

to me what it's doing here?

And then, like you still write the code if
you want, my opinion is that if you want

to learn, you still have to go through it.

But yeah, I would say.

Those can be very useful, like
asking LLMs, Hey, I have that.

And then it's going to tell you,
okay, write this code, but then

Hey, have it explained it, please.

Oh,

Nicolay Gerold: and what's next?

What's next for you?

What's next for Zilliz , Milvus
the horizon you can teaser?

Stephen Batifol: I know we're going
to have like full text search.

So that's a cool part, going back
this no, I think for us, it's

like, for me, it's going to work
more actually with multimodal.

I would say in general, like
what we talked about before, I

feel like it's going to, going
to only get better and better.

So that's something I
want to work more with.

And then also just in general, having
agents that are like, like taking

over things I don't like doing.

So if I could have an agent that could
call someone, I'd be the happiest

man on earth, I hate calling, so I'm
actually trying like now to build

on the sides, with some, something
like you have real time voice and

then, like actually trying to answer.

Some questions.

So yeah, maybe at one point I'll have,
I can showcase it, if it actually

works but yeah, otherwise multi model.

Nicolay Gerold: Nice.

And if people want to follow
along with you or with Syllis,

where can they do that?

Stephen Batifol: Yes, they can
find me on LinkedIn or on Twitter.

So it's Stefan24 on LinkedIn and StefanBT,
on Twitter, or X, I should rather say.

But a lot of things I do
is published on LinkedIn.

Also I'm based in Berlin, if you
want as well, I organize meetups.

So I organize unstructured data
meetups monthly, which is like

everything about unstructured data.

So what can we take away when
we're building search applications?

I think the first thing is
don't fall into the hype cycle.

I think like agents is at the
moment like peak hype in a lot

of the like LLM search area.

And agents in a lot of ways has lost a
little bit of its meaning because it's

used so freely for Different things for
workflows for the multi agents where

the LLM actually plans itself like the
react based agent, but you basically,

you have multiple steps the agent
has go through, he interacts with the

environment, he observes the state, and
then he makes a decision what to do next.

And then basically steps over time,
this like close to in the direction

of reinforcement learning, or like how
reinforcement learning defines an agent,

but the learning component is a little
bit missing, and optimizing basically

a policy which optimizes an objective
function, which isn't really there yet.

And I think that's, this is the main
thing missing for me from agentic rack

from actually working, you don't really
have a good way of optimizing agents,

because you don't have a clear way to
do performance analysis or debugging.

Because if the agent is planning freely,
and then there are like four different

agents or sub agents, which are doing
different tasks and have different tools,

interacting with each other, like, how do
you know where it actually went wrong or

where did it start to run wrong because
there can be a massive path dependence

and that's why I actually don't like the
free planning agents and for me Agentic

RAG where it's really interesting is
actually we interjected at some portions

of the control flow we decide based
on for example the user query which

database are we querying and which type
of search Are we using, for example,

and this allows you that you can, based
on the context, based on user profiles,

you can take different paths through
your query system and Also, what Stephen

mentioned with actually, based on the
answer you in the end generated, or you

could do the same based on the results
you have that you feed like in the top

10 results, you can actually decide,
okay, am I serving the information need

from the user and try to give the LLM as
much context about the user as possible.

And if not, you can basically try
a different query or rephrase the

query, do query expansion, whatever.

And this will add to the latency of the
query, but if you're not time constrained.

If you're, for example, not in an
e commerce application, but the

quality of the results matters more.

I think this is something that
could be very interesting.

Especially if it's very critical or data
or results that go in the direction of

if you're making decisions based on it,
where you actually should spend the time,

hey what are actually the relevant results
and what am I trying to answer here?

And I think asking the LLMs am
I serving an information need is

probably pretty tricky, because it
can be argued in many different ways

that a document might be relevant.

I would probably turn it around and
do like more of an error analysis.

What's wrong with the different
documents given the query.

And use that as feedback to maybe
improve the query to expansions or take

different query paths or expand the
query in, in certain ways or restrict it.

And through that, you probably also
get really valuable feedback data

for your search system which you
can use to improve it over time.

And yeah, I think for Eccentric
Reg, we, I've implemented it.

twice now, but you basically use it
for certain points in decision making.

Also an interesting project on
that is from Mixspread their

BMX project where they use LLMs
for query expansion, basically.

So they generate alternative queries
in the beginning which for me is

also like a form which would fall
into this agentic rack definition.

But the most used approach is
more like the dynamic routing.

Hey, where am I going?

Which database system, so that query
and how should use it in the end.

And yeah, that's basically it.

I would love to get your opinions
on ng req and whether you

implemented something like that
before and how successful you were.

Did you actually see quantitative
improvements in, in your

search metrics, but also in the
feedback you get from the users?

And this.

I think it's a little bit missing
in the Rack space in general.

Like it's not as metric driven as
traditional searches and we are getting

there like especially with Ragas,
for example, we are getting more and

more metric driven but it's also a
little bit trickier in my opinion,

because we are not just delivering the
results, we are delivering answers.

So we have to evaluate
multiple components.

At the same time.

And yeah let me know what you think.

Also if you like the
episodes, leave a like.

If you're on YouTube,
leave a review on Spotify.

It helps a lot.

And otherwise I will catch you next week.

So talk to you soon.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#035 A Search System That Learns As You Use It (Agentic RAG)

Subscribe