S2E16

#033 RAG's Biggest Problems & How to Fix It (ft. Synthetic Data)

November 28, 2024 · 51:26

Nicolay Gerold: The retrieval augmented
generation or RAG is often treated.

Like a search out of the box.

You take your data, you slap some
embeddings on top of it and you

just run user queries against it.

But in reality, you need
much more than that.

You need systems, you need metrics, you
need processes that actually allow you

to get a quantitatively good RAG system.

So how can we get to that?

Today we are talking to Saahil
Ognawala the head of product of

CinaJina and we talk about the systems
and processes necessary to build RAG

systems and how synthetic data can help.

Welcome back to How AI is Built.

Let's jump right in.

Synthetic data is something we also tried.

For us, the task was much more
important than the persona itself

because we were trying to create this
instruction tuned embedding model.

Saahil Ognawala: So we dropped
the persona topic altogether.

But I can totally imagine that for
language generation specifically,

adding personas would add another
dimension could be quite useful.

Nicolay Gerold: What
dimensions did you guys use?

Saahil Ognawala: So we had this I'm
not sure if you read this E5 paper.

It was a very popular paper, right?

The E5 Mistral Instructor
Embeddings paper.

And I, basically the entire idea of fine
tuning with synthetically generated data,

at least in Jina started when this model
was released and we got into understanding

like what makes this model good and so on.

Of course, the main proposition of this
paper was that you can train you can

train a great embedding model by just.

Using synthetic data as a
proxy for distillation method.

So you would generate high quality
synthetic data using what, is

a good quality LLM as well,
which in that case was GPT 4.

And then you would fine tune a
Mistral model using the synthetic

data and basically just treat
it as a decoder only model.

Instead of just an embedding model, right?

And in this paper, they
only use synthetic data.

I think it was in the order of
tens of thousands if I'm not wrong.

And what they introduced
was this nice idea of like

conditioning based on task type.

So the authors of the paper, they
wrote a few templates of what types

of tasks were they interested in
to solve with this embedding model.

But not In addition to this category,
they didn't tell it anything else.

For example, if the type of task is
classification, they didn't tell it

that you might use the embedding model
to classify documents of one kind

versus another kind, or you could
use it to classify sentiments from

one kind to the another kind, right?

So they would set a task template.

But within these task template, they
brainstorm together with the LLM,

what topics or what specific tasks
can you do with the LLM with this

downstream train embedding model.

And then in the next step, you would
condition the data generation based on

this task that the LLM itself generate.

So it's like a very nice way of
constraining what the LLM produces,

but within those constraints
be like very diverse and as

creative as you want, which is.

like really the trick to
synthetic data generation.

Nicolay Gerold: I think the
movement is really interesting.

We started out basically
with rule based systems.

We moved into training with data
and now we are basically moving

into generating the data we want.

So we can train the AI to do what we want.

Saahil Ognawala: Yeah.

And the leap has also been
quite quite extraordinary.

I remember when I was working for
when I was not working for Jina but

I was working for a company that
we were using structured data for

classification generating synthetic
data was still a topic back then, even

when LLMs were not around, but then
it was always a given that one side

to this argument would always be okay.

If you can have a system that can
engineer synthetic data it would

be not trivial, but close to being
possible to also reverse engineer the

system to just learn the rules, right?

So why wouldn't you do that?

And just the leap from that point of
thought to today, where you say that,

okay, of course you should use synthetic
data because none of the intrinsic

models that you would have for your
business would have close to the world

knowledge that Mellon would have.

So it's.

It's futile to discuss whether you can
reverse engineer it to the rules that

you're using to generate synthetic data.

Nicolay Gerold: Yeah, I think the
general capabilities are necessary

that we can move beyond the behavioral
cloning that we just have a fixed set

of rules which are ingrained in the data
and the models actually learning them.

Saahil Ognawala: Yeah.

Correct.

That is still to say that we don't know
what rules are encoded in the LLMs.

It's just that we know that the volume
of these rules is so high that it's

very likely that your your synthetic
data rule, which is somehow implicit in

the business is also encoded in there.

So it makes sense to use it.

Yeah, I agree completely
that this leap is.

is something that you simply
cannot ignore right now.

All you can do is like closely
control it, in my opinion.

Nicolay Gerold: Yeah.

What do you think can we take away
or learn from the previous eras the

more traditional rule based systems
and rule based retrieval systems?

Saahil Ognawala: The main thing, in my
opinion, is still that you need to have

control over what your what your end goal
with these retrieval systems is, right?

Ultimately, everything that we
talk about is revolving around the

revolving around understanding what
is the intent of the user, right?

If you know what the user was intending
to do with your search system, Or

intending to do with the results that
you get out of this review system.

There would be no better proxy
to this than saying that, okay,

I know what the user is going to
do with these search results and

then what they're going to with it.

That's why I should show them
a certain set of results.

At first, everything that we do is put
a proxy on this intent understanding,

which is what we did 20 years ago.

20 years ago, I was not
working in the field.

I was still a student, but I imagine
that's what the seasoned professionals

in retrieval still did 20 years ago.

And I truly think this is still
something that we're doing today.

Like understanding the intent of
the user is much more different than

creating rules about how should you
rank results or what kind of learn

to rank algorithm you would use.

And that goes again into what
you said about these persona

creations in therapy, right?

One of the main reasons why the persona
creation with synthetic data is so

powerful is because if you create enough
personas, you also create enough implicit.

intents behind why someone
would use it, right?

And this certainly has not gone away from
the traditional systems to now as well.

Nicolay Gerold: you think this part
of actually trying to understand the

user intent is a little bit lost at
the moment in, especially like the

RAG space, because everyone is just
basically trying to slap embeddings

on it and using it like as a cure all?

Saahil Ognawala: I wouldn't know if I
wouldn't know if I could make a judgment

one way or the other, whether we focusing
and focusing on it enough or not enough,

because ultimately I don't think there
is a way to understand the intent of the

user directly, unless the user directly
tells you something and what LLMs and

RAG specially allows you to do is It
at least offers an opportunity to the

user to lay this out very explicitly
that I'm going to attend a ceremony,

a wedding ceremony of my close friend.

I need to stand out in the crowd, show
me ties and matching shoes that make me

stand out in this occasion, but I being
like seasonally appropriate and all that.

All of this was simply not possible to
do with the traditional token based or

just keyword based subsystems, right?

So at least we're trying
to get close to that.

I have to say that there is also the
inherent risk that the users itself bias

themselves to closely to what they think
works for a RAG system with instead of

being like really instead of maintaining
the fidelity to what they actually want.

So that risk is obviously there.

So yeah, that's the most I could
say about it that RAG allows you

to express this in a better way.

But if you're only using a naive rag
technique to then translate it to

something that your product catalog
can understand that is a part that

we're all trying to get better at.

I don't know.

I would also be interested
in understanding, like, why

did you ask this question?

Like, where have you seen this pattern
where you say that the understanding

of the user's intent is something
that we completely lost sight of?

Nicolay Gerold: I think, Also in
LLMs, most people are like, the field

always starts out very raw, trying to
invent something completely new, and

then over time, they converge again
to like, How we've done it previously.

And this basically means being very
metric, different, very methodical,

setting up evaluation data sets, and
trying to iteratively improve the

system over time based on metrics.

And I think at the moment in the
LLM space, we are already doing that

loop back that we're starting to
set up like evaluation data sets.

I think in RAG that's
still a little bit lost.

It's still a little bit

Saahil Ognawala: I see.

I see.

Yeah.

I would agree there with you as well,
especially people who are setting up

RAG for the first time or are doing this
as a replacement to traditional search

engines I think there is a certain
lack of missed messaging, perhaps

that none of this language generation
will help you move away from the

fact that you need to set up your own

retrieval benchmarks, which is specific
to your business use case, you need to

have an expert look at these retrieval
benchmarks as well and continuously

update them based on whether the needs
of the business has changed, whether

the system that you have put in place
showed behavior that you were actually

not interested in deriving from it.

And it isn't a magic wand that
you can just place on your catalog

and expect results that you didn't
expect that you didn't get before.

Nicolay Gerold: Yeah.

I think like for me, search is not
like an AI practice, but rather

like an engineering practice.

It's very methodical.

You have to follow like a practice.

You have to follow like what
the data tells you in the end.

And it's more about a frame of mind, how
you approach it rather than who does it.

Saahil Ognawala: Yeah, absolutely.

Correct.

And and this frame of mind
is not something that you can

replace within AI, unfortunately.

Not right now.

Anyway, this frame of mind is something
that That is also cultivated in a

way in the minds of an expert who was
experienced with retrieval who knows

what metrics to tune, as you said, right?

This is not something that you
can replace with an LLM entirely.

That's why I also say the most interesting
part in even synthetic data generation

either for evaluating that or for fine
tuning, embedding models, for example,

the most interesting part is not in
Not in whether you can generate data

that you wish you had, but you simply
don't have because real users don't

interact with your system in that way.

But the real interesting part is actually
constraining it within what your system

is supposed to do and let it Be as
creative within those limits as you can.

That's really like the
needle in the haystack.

Nicolay Gerold: Yeah.

What do you say makes For a bad evaluation
data set for RAG or retrieval in general.

Saahil Ognawala: Okay.

I want to start with what makes
it an acceptable data set at all.

And it would become clear why I start
with this when I get to what makes it bad.

Because one of the things that is
overly missed typically in these

retrieval benchmarks is is a definition
of what is a bad result, right?

And I'll tell you what a typical
retrieval benchmark evaluation

benchmark would look like.

You would have a question
and you would have an answer.

In case of traditional search system,
this answer was was like a document

ID or a product ID or a product
title that you should be retrieving

given a user's search string, right?

In RAG, you can think of this as a
user's query in natural language and the

answer as being an answer generated by
an LLM also in natural language, right?

The right answer.

What is equally important in my
opinion is having some bad examples

of what an LLM can give you.

And this is what makes it
a bad retrieval data set.

If you don't have the examples where
you say explicitly that this would

be a bad result if my rag was to spit
this out, hallucination is of course

one one part of this, but if you.

can't tell in real time usage whether an
answer is a bad answer or a right answer

because the LLM just makes it look like
the right answer then you only have your

retrieval dataset to blame for this.

Hallucination is like a special
case of this bad answer.

And an example of this could
be like, you could be having a

chatbot for for insurance policies.

For example, if you have a chatbot that
answers questions regarding insurance

policies The whole idea behind this
chatbot, at least I imagine this would

be the whole idea, would be that these
policies are so complex that a layman

Who just wants to cover their cover their
assets over the course of a year is not

interested in going down to the deep
specific of what each clause says, right?

And you would then have this chatbot
explaining to this person who's your

customer, hopefully what is an answer
to a question like, okay, if I get sick

abroad which hospitals am I allowed to
visit which hospitals are covered by this?

Travel insurance policy,
travel health insurance policy.

What you would expect the RAG system to
give you is an answer grounded in facts,

of course, but of course, an LLM can
generate an answer that looks like it.

Ground.

It's based in facts.

It's based in any clause of the
policy and the user would simply

know not not know any better.

And then it's actually the task of
your retrieval benchmark to say that,

okay, even though clauses one, clause
one and clause two both sound similar.

And even though they're both answering
the question related to what are

the overseas travel policies related
to related to when you get sick.

Okay.

This answer is more
correct than this answer.

This answer is actually
not correct at all.

Because it doesn't specifically
say what kind of hospitals can

you visit, just as an example.

To close the circle on the question
that you said, one of the bad

practices, in my opinion, to what
would form a bad evaluation benchmark

is when you wouldn't have specific
exams on what forms for a bad answer.

Nicolay Gerold: Yeah.

Do you typically set up basically
end to end evaluation data sets and

then a data set only to evaluate
the retrieval system as well?

Saahil Ognawala: Typically you would
do both of those things obviously,

but typically what our customers do
is use some ready made frameworks

that allow them to set this evaluation
framework up and these ready made set

setups like the ragas, for example, or
quotient AI they have a lot of inbuilt

metrics that they even though all
they ask from the user is the pair of

question, answer, and retrieve context.

And the idea is the following.

Basically, what you want to check
is whether the context that you're

giving to the LLM is the right
context given the question, right?

So if you have a user's question,
you want to retrieve the passages

from this insurance policy
document that I talked about, right?

You want to retrieve the right
passage that are relevant

for answering this question.

And these tools measure whether you are
actually retrieving the right context.

Okay.

But then you also want to measure
whether the generated answer from

the LLM is following this context.

So these are the two key metrics
that you should always track one

is the precision of the context.

So the context precision and the recall
of the context and this recall is actually

reflected in the answer the LLM generates.

So usually you would actually
benefit a lot from having a human.

Generated answer for a user query,
because then it makes it easier

for these automated frameworks to
see how how similar is this user

generated answer to an answer
that the LLM creates this is one.

And then finally, there's also this
idea of faithfulness In the answer

generated by the LLM and faithfulness
is nothing but a very, it's a very crude

metric in my opinion, but it works.

It's basically saying that from the
amount of answers that the LLM has

generated, how much of it is relevant
to the question and relevant to the

context that you provide and how much it.

Or how much of it is fact, but
there is no way to check it.

So if you're asking about a policy
document and the LLM is giving

you answers about about your
house, how's it called in English?

Or a legal policy document legal
protection policy document instead of

a travel insurance policy document.

It is right that you have some kind
of legal protection when you might be

abroad, but it's simply not relevant
to the user's user's question.

So that's the concept of faithfulness,
which is like the proportion of how

much of the answer is relevant to the
context and the user's query versus how

much of the answer is there might be
right, but it's simply not relevant.

Checkable, you just can't check
whether it's right or not.

Nicolay Gerold: And if you like put
your consultant on even further, if

you come into a new company, which has
a RAG system, but no evaluations, like

what are like the steps you typically
would go through to basically build.

And the evaluation data set for the right.

Saahil Ognawala: Yeah, the first thing I
would do is is find out if this company

has any experience with building a
search and retrieval systems at all.

Because if it has any experience
in doing so, then it would be clear

very soon to them, even before LLMs
came into the picture that you need

to set up some kind of an evaluation
framework for this retrieval system.

And these evaluation benchmarks for
simple retrieval systems are honestly

like worth their weight in gold?

They don't pay anything,
but they're golden.

So that would be the first place to start.

How would you turn these existing
benchmarks into something that would be

understandable as a user of a RAG system?

Let's say how would you convert the
nature of the queries that you're seeing

on these retrieval system to natural
language queries if that's what you're

expecting your RAG system to answer
and the same thing for the retrieval,

for the retrieved answers as well.

So that's number one.

I would I would take this existing
probe of information that has

already been used before to create
some kind of retrieval system.

The next thing I would do is to
set a baseline for evaluation.

You need to find a way or
you need to have a way.

Ideally, it's always the way that
is easiest to implement has a well

grounded and well researched reasons
for why these results are computed in

a certain way and typically would just
be BM25 So I would set a baseline of

it BM25 retrieval which tells you given
a question, whether you're treating

the right answers using just BM25.

And that is your minimum, right?

If you cannot do better than BM 25
by introducing semantic concepts

to your to your retrieval system,
then why would you even do it?

The next step on this would be to
understand like the business context

around it, typically a company that
is looking to move into rag from.

Having only a traditional
retrieval system in place.

Typically, they would have on
top of PM25 some kind of business

specific re ranking logic, right?

And I would try to encode this
business specific re ranking logic into

either an LLM or a business specific
maybe fine tuned even re ranker.

Because these re ranking logics
again is something that traditional

BM25 engines would not be able to
encode because they're only based

on keywords, synonyms and symbols
in your search, basically syntactic.

So this is something
that I would do as well.

And in terms of setting up a rag end
to end that includes semantic context.

That's when it comes to evaluation
of different embedding models for

retrieval, because without having
a close control on whether your

context that you're retrieving To
give to the LLM is correct or not.

I don't think it's worth creating an
entire end-to-end pipeline at all.

And I especially don't think it's
worth evaluating which l LM is

best for your use case if you can't
get a handle on whether you're

retrieving the right context or not.

And once this is done, only 10 one
would go into evaluating what LLM

would be best fit for the use case.

What would be the criteria for
determining whether you should, for

example, fine tune the embedding model?

You simply shouldn't find to the other
than this is something that I would just.

Put at the onset that usually if you're
just setting it up for the first time,

you simply shouldn't find in the LLM.

Finding an embedding model is a
good question and figuring out what

are the choke points at which an
embedding model doesn't perform

well for your business use case.

And then setting those up again as
an additional baseline for seeing

whether fine tuning would make sense
would be the final step for me.

Nicolay Gerold: Yeah.

And especially, I think like before you
move into fine tuning, you should have

a good evaluation data set and also
an automated setup that runs basically

your new retrieval pipeline with all
its new components and to end and

evaluates the different components and.

The end to end performance.

Saahil Ognawala: Absolutely.

Absolutely.

Agreed.

And internally at Jina we have this we
have this unsaid rule, which is not in

any policy document that we have, or not
like any of the usage policy document,

but whenever we get requests from our
enterprise users, Asking whether we

can find you in a model on the business
use case are one rule is that they

have to have used up at least one or
two billion tokens, either on our API

for embeddings or on on their own self
hosted system for it to make sense for

us, because it's not about the tokens
that they've used, but really, it's about

understanding whether they have used a
general purpose embedding model enough.

to even come to the conclusion
that maybe we should try out fine

tuning as an antidote to not having
enough precision in your context.

Nicolay Gerold: Yeah.

And what I'm really interested in,
especially in the dataset for fine

tuning, embedding these models
as well, is like the concept of

negative and positive examples.

For retrieval.

And these, especially like for fine
tuning are very interesting because you

can do more interesting losses for the
fine tuning, especially like contrastive

loss which boosts the performance.

Can you maybe explain like what
are negative and positive examples

and also how are they created?

Saahil Ognawala: So that's an important
question for embedding models, because

the common knowledge in this field
right now is that you want your loss

function to decrease the distance
between the user's question and it's

right answer and increase the distance
between it and the wrong answer.

And what makes it even more interesting
is this notion of hard negatives.

And the idea of hard negatives is that
They sound like right answers, but are

actually wrong answers because they
simply convey the wrong information

to the right answer to the right
question, even though they might have

a lot of same boards or same structure.

For example, if your user is asking for
a gluten free chocolate cake recipe.

And one of the answers is
a flourless chocolate tart.

This is close to the right answer.

This is what we call positive, but
as another recipe, which is called

sugarless chocolate cake or sugar free
chocolate cake, it might sound like

the right answer because it has a lot
of the same words as the question.

But it's not the right answer.

So you shouldn't give it as an out.

And what you want to do during
your data preparation time is use

something like a cross encoder model
to get a lot of the similar sounding

in structure and tokens answered
from your question answer dataset.

And you mark most of them as hard negative
for a question, except for the one answer,

which you know is the right answer.

This would be one way of doing this
hard negative mining for fine tuning or

post training your embedding model so
that your model learns to put the hard

negatives or these right sounding wrong
answers far away from the question.

Nicolay Gerold: Yeah, I think for hard
negative, the best example that always

pops into my mind is like the Paris
versus Paris Hilton example, when you

do like regular term search, then you
depending on the application, if you

have like a music application, or like
a fashion one, you rather want like

the Paris Hilton one to pop up, if you
have a more geographical Google search

query might be probably that's the

Saahil Ognawala: Yeah, correct.

Even though both Paris and Paris Hilton
have 50 percent similar tokens you want to

make sense of these terms in the context
of the question that you're having.

And that's what setting the
right contrastive loss with

positive and negative examples.

That's the sense.

Nicolay Gerold: Yeah.

And it's actually very hard to
really, for one, get people.

To say what should this query result in
and which like results are really like

the negative and hard negative examples,
unless you really spend some time building

those datasets, you don't really realize
how hard it is and how hard it is to

actually codify it in some kind of way
in a document or just like basically tell

people like, what are you looking for?

Saahil Ognawala: Absolutely.

And if it wasn't this hard, you
would expect that a purely semantic

engine that takes I don't know,
cosine distances between the query

and the answer or two sentences on,
you would imagine that it would see

enough information out there that is
non correlated that would, it would

simply learn this distance by itself.

The fact that it is hard is reflected
in this fact that the hard negatives

and the queries actually do appear
together in the real world data a

lot that you have to manually label
them to separate them together.

Nicolay Gerold: Yeah.

And this is actually like for
the model to know what is the

users or your users intent.

It's impossible.

So you have to set it.

How does synthetic data
actually come into play here?

Can it be used to basically generate
All of the data set, like positive,

negative, hard, negative examples.

Saahil Ognawala: So in our experience,
yes, actually we have a we have an

automated pipeline to do this as well.

Again, I refer back to the E5 paper
of Microsoft where they had this

interesting idea of conditioning
LLM outputs on diverse topics

generated within some constraints.

And what we found is that you can
extend this idea of conditioning.

LLM outputs by also including different
data types and including different

data formats that you want generated.

So essentially in the so called task
template that Microsoft authors use, they

they use different task templates like
classification or clustering or retrieval.

What you want to add to that is also
something like misguided examples

or or negative examples for.

for a query.

And using this, you can actually force
the synthetic data generation part.

So essentially any LLM worth its
salt to generate hard to distinguish

examples for the same question.

So you can ask it again, a question
about geography of France and

have two different answers to it.

And in fact, the hard negative is it's
much better to condition the hard negative

on the positive answer in this case,
instead of only conditioning on the query,

because that makes sure that you have
some examples that are counterintuitive to

what the current correct answer should be.

And that's what you should label
essentially as a hard negative.

So synthetic data generation in this
sense can be very useful to get a

large volume of hard negatives because
Typically in fine tuning stages finding

the right balance between how much hard
negative data you need, how much triplet

generation you need compared to how much
pair data set do you have for contrastive

learning is actually the tricky part.

And as with everything else suggestion
is that synthetic data should be

used not as a replacement for expert
decision, but actually as as a factor

that can inflate the amount of volume
that you have of hard negative.

I would still recommend having
complete control over what the

hard negative topic should be.

For example, where should it
pull where should it sample

these topics from a dictionary.

Multilingual dictionary, if you may
but when it comes to conditioning

them on a few examples from the past
the best results appear when you do

use few short learning in this case,
rather than zero short learning.

So you do give the LLM a few examples
of what a question, a good answer and

a bad answer to this question looks
like, then condition them on a diverse

set of topics from dictionary and let
it generate hard negatives based on.

The query, the diverse topic,
as well as the true negative or

the true positive, so to say.

Nicolay Gerold: Yeah, I think the concept
is it's, I think the concept of like

synthetically generating hard negative.

It's like a little bit hard to
grasp because you're trying to find

something that's semantically very
similar, but conceptually different.

And do you maybe have an example of
a past project you've done, which

makes it a little bit more like

Saahil Ognawala: We we had
this we had this project.

So when we were trying to publish a
paper in ICLR to add to this E5 paper

of Microsoft, we tried to generate a few
hard, negative examples to show whether

the fine tuning makes sense or not.

And one of the examples that we had was
something to do with, something to do

with ocean life, marine life, right?

And the idea was that the topic that
we were interested in generating

was related to marine biology.

The questions were all related
to what scientific classification

does this kind of a whale have?

And then you have some answers to it.

And some of the answer, if you didn't
condition it exactly on the topic

of marine biology, the LLM would
go into generating answers that are

related to marine engineering, saying
things like, okay, this instrument

is widely used by marine engineers to
measure depths in oceans and so on.

It has everything to do with marine
biology, but nothing to do with

the question that I asked about
whales and without conditioning

it on the true positive.

or the right answer it would go on
blathering about all of these unrelated

machine, marine engineering topics,
which would not semantically, obviously,

because a human, I can tell that the
semantically not relevant to the question,

but at least from a mathematical point of
view, you would get a close enough cosine

similarity or some kind of similarity with
this but only without conditioning, but

you, when you start conditioning it on
the true positive, it starts in reading

answers about fishes and whales which
you wouldn't, if you wouldn't condition.

You can imagine that marine biology
is closely related to biology

and a lot of research fields.

So there would be a lot of terminology
that would co occur, but not exactly

the ones that were related to the
question that the user was asking.

I don't know if this is an interesting
answer for your for your readers, and

I'm happy to find more in the question
draft in the paper draft as well.

But just something that stuck out to me.

at least as one of the error cases
that we were trying to deal with.

Nicolay Gerold: No, I like it.

I like the examples like from practice,
I think like stuff in marine biology,

like the more like for tech people,
the boring areas, I think they give

for really interesting use cases.

Most of the time, how do you
actually go on the reverse?

How do you actually
ensure that the queries.

You use or synthetically generate match
up really well with the user queries.

Saahil Ognawala: And that's something
where I think there is a lot more

human intervention that is helpful
than only relying on synthetic data.

So I, I already mentioned the
idea of few shot prompting

rather than zero shot prompting.

So I do think that it's highly valuable
if you have the A predefined list of

questions that a human has written or
something that you have gathered from

a cold start use case, for example
you set something up in production

without any evaluation in place.

You collect the kind of questions that
the users are interested in answering.

You maybe map them to answers that
you experts would label, and then use

them as a as a few short prompting.

examples for an LLM.

This, in my opinion, would be the
best way to make sure that the LLM

don't diverge too far from what
your idea of a good data set is.

You, of course, also want to make
sure that the LLM never escapes this

conditioning logic that you give it.

So it should always be constrained
within the conditions that you give it.

Give it should be constrained within
diversifying based on a dictionary of

topics is a great way to make sure that
LLMs only generate questions about a

few topics that you're interested in.

I know that essentially this
would mean that you only have.

A few questions given given a generic
embedding model, for example, right?

But it really does work.

If you pick a multilingual dictionary,
just like a normal dictionary with a

lot of words and you remove all the
adjectives from this dictionary, you

remove all the pronouns and prepositions
from the dictionary, you do have a

diverse list of topics, so to say,
that an LLM could could pick from

while still being constrained within
the limits of what it should generate,

and while still being constrained
to following the same logic that you

give it in the few shot prompting.

So basically, the answer is few
shot prompting is better than

zero shot prompting in this case.

Nicolay Gerold: Yeah, he already ran
through like a bunch of stuff, like

for example, user personas, a few short
examples example set of documents.

What are other inputs that you've
used for synthetic data generation?

Also like the topic dictionary,

Saahil Ognawala: Correct, yeah, so topic
dictionary is one of them that we've used.

There is of course this idea that
you can incentivize an LLM with kind

of incentives that you would give
a given intern like I'll cut your

salary in half or something like this.

I honestly haven't I don't
have any like empirical results

on whether this works or not.

So I wouldn't be able to recommend one way
or the other to be very honest with you.

This could be one and the other
kind of topics that you can that

you can give it is of course I
don't know if you also discussed it,

but tasks would be another topic.

So a lot of the embedding
models, especially, And

nowadays are instruction driven.

So you do want to give it the instruction
that the examples that you would generate.

based on this topic are going to be used
to train a model for clustering, right?

So make sure that your examples
generate multiple examples that

are clustered around the central
topic of this while being far enough

from another topic that is why.

So you do want to give it
an information of what the

downstream task is going to be.

It is also important, by the way, to
not veer too wide and far, because

there is this problem of being lost
in the middle with a lot of LLMs.

So you want it to be, want the
instructions to be also be very precise.

Try to make sure that the data that
you generate is in structured format.

And they should use it as well.

Yeah, and that would be the extent of
what you can give as an output to to

an LLM for synthetic data generation.

Nicolay Gerold: Yeah.

And what is like the balance
you actually strive for between

synthetically generated data and real
world data in your evaluation dataset?

Saahil Ognawala: Yeah.

That is a very interesting
question and one that I'm also

trying to answer for myself.

Again, I don't know if I have
empirical results for this.

Specifically, if you're talking
about synthetic data for evaluation,

I don't even know if there are many
studies that show that you should

have X percent of synthetic data
versus Y percent of real world data.

There have been some studies On this
when it comes to actually training models

and depending on the depending on
the task type that the models are being

trained for this ratio might also differ.

There was this DeepMind paper this year
or last year talking a lot about model

performance as they improve or deteriorate
with synthetic data generation.

And they themselves showed that a lot
of efficiency gain could be gotten from

using synthetic data for tasks like
improving your, Mathematical abilities

or not your mathematical abilities,
but an LLMs mathematical abilities

improving the alignment of multimodal
LLM between like images and text, right?

This is something that's intended.

I can have a lot with I had a look at
these papers this morning and I still

couldn't find any numbers about them.

But I did see a lot of online discussion
and this kind of checks out with also

our own experience of training embedding
models internally that a good split

is somewhere between half and half.

The the half proportion of real world
data, make sure that basically the quality

of the model is beyond something that
a like a non fine tune model can have.

While if the synthetic data is far
beyond half of a total volume of data

that you're training with, then you
stop getting these gains that I just

talked about that I mentioned in the in
the paper by Google DeepMind as well.

It seems like this the agreed consensus
right now unpublished in any academic

source that I could find is around half
and half of synthetic data versus real.

Nicolay Gerold: Yeah.

And how do you actually go
about like really methodically

improving the data set?

So you have your RAG system basically in
production, how do you actually improve

your evals as your system evolves as well?

Saahil Ognawala: And this is all about
basically once again, closing the

circle on user intent the best signal
that you can get on these RAG system,

just like the best signal you could
get on traditional search system was

to understand from the user, if the
answer was useful to them or not.

And this is something that you that
you have to continuously keep doing.

Most of our users are enterprise
users who have seen the most success

in their RAG systems are the ones
that very early implemented a

continuous feedback mechanism.

Having worked in in an enterprise
setting myself having implemented

such a system, I always have to also
admit that this is not so easy to

get feedback from the users directly.

Most users don't want to spend any
time on telling you whether this

answer has been useful to them or not.

So you, in addition to adding an
active kind of a feedback loop,

you also have to think about
proxies that can tell you this.

For example, oftentimes what
we used to do is that we used

to measure the time between two
queries by the same user, right?

Which would tell you whether with the
first query, the answers that the users

got were useful to them or not, like it
would at least indicate this was the case.

Then there are obvious things
like click through rates of the

top three results or the top ten
results or top one result, right?

And if those click through rates for
the top K results are beyond a certain

threshold, that is telling you that
in most cases the top results that

you're showing the users are the ones
that they are actually interested in.

And Again, none of what I've said, all
three items, direct user feedback, proxy

user feedback, or click through rates.

None of these are unique or novel in
terms of what drag has put forward

in places is basically classical
search system implementation.

Nicolay Gerold: Yeah.

What's next?

So what's next for Jina?

What's next for you?

What's next for synthetic data?

What's on the horizon
that you're excited about?

Saahil Ognawala: I'm excited, basically a
lot more about our specialized models that

are sub 500 million parameters, sub 300
million parameters, so base size models

that are specialized for certain tasks.

For your listeners who don't know Jina
is in the business of creating embedding

models, general purpose embedding models
multilingual bilingual so far, what we

would continue to do or what we would
like to continue doing is create embedding

models that are easy to host on, let's
say a CPU while being as performant

as decoder only embedding models.

Off a billion parameters, right?

Decoder only embedding models,
of course, great performance.

But who has the capacity to host
them even on a single GPU, right?

We want to make these specialized
models so good in terms of what that

architecture is and what kind of a
specialized fine tuning data set that

seen that you simply can't look past
them that the latency is so good.

So this is something that
we're super interested in.

I personally have also gotten
more and more interested in the

classification problem and why what
is the appeal of boosted trees, for

example, when it comes to classifying
in the classical way, right?

I have a hunch that the explainability
of these traditional methods play a

big role in why people like to use
them for classification and not.

LLMs.

And this is something that I think
is a super interesting topic for deep

learning scientists to also look into
improving explainability of their models

embedding models, classification models.

And I truly hope that if there is
one message that your listeners

and basically everyone building AI
takes away is that it's not that AI

is a magic wand that can do things
that you will not able to do before.

That's not the right way to look at AI,
but it's actually a way to, first of

all, just like synthetic data, right?

Increase the volume of the
of the right operations.

And it's really a way to think about how
you can increase the efficiency of the

processes that you already know how to
accomplish, but then you replace the pure

tedious and boring jobs that you have to
keep repeating and instead use AI for it.

That's where really the power is.

Nicolay Gerold: Yeah, and if people
want to start building the stuff we just

talked about, where would you point them?

Like, where can they get started?

Saahil Ognawala: They should, of course,
check our website Jina.ai is our website.

Our blogs are actually, in my
opinion, a really good place to

start for this as well on our blogs.

We put out articles that are not
conditioned on any new developments

that we're doing necessarily.

Of course there, there are release
posts and so on, but there's a lot of.

Opinion posts on our blog as well,
which I truly think are not biased

towards only what we are building.

They're quite opinionated
as I think they should be.

That would be a good place to
get like some food for thought.

In terms of building, for example,
synthetic data sets, evaluating your

RAG systems please check out also
open source frameworks like Ragas.

I'm really a big fan of these frameworks
because they have a good using open

source framework is the only way to
understand like what metrics are these

frameworks measuring on your RAG systems
and not making them completely opaque.

The judge of your LLM
is purely another LLM.

Then you also want to look deeper into
why you should be doing that at all.

And why shouldn't you
have human annotated data?

So I can highly recommend also
using ragas for evaluation.

And yeah, if you're interested in fine
tuning your own embedding models or seeing

how synthetic data can improve models by
fine tuning them on your domain, we also

have a fine tuner tool on our website that
basically uses synthetic data based on a

domain and task description to fine tune
either our own models or VGE's models.

Nicolay Gerold: Yeah, and if
people want to follow along with

you, where can they do that?

Saahil Ognawala: You can find me
on X my handle is sahil, S A H I L.

I'm also on LinkedIn.

I'm also on fostered on if you're on the
petty worse and just follow me on GitHub.

I sometimes comment there on our
open source repositories as well.

So what can we take away?

I think.

Especially if the point of that rAG
is no magic wand you can place on your

product catalog or your dataset and
expect results you didn't get before

is one of the most important parts.

That embeddings might bring a really
good performance out of the box.

But the performance is often capped and
it's capped at a point that it isn't good

enough for most research applications.

Once you bring them to the users, they
mix might seem like they're working well.

But in the end often they
truly aren't, especially.

'cause LLMs are really good
at bullshitting and have a

lot of internal knowledge.

RAG systems often have the
issue that you don't really know

what's going on under the hood.

If you don't look at what is actually
returned by your retriever component.

And often it's the case that even
when you retrieve complete garbage.

The model can make
something useful out of it.

And.

You basically have to first figure
out, okay there are some issues and you

queries, what are the problematic areas?

Why did it return these results?

And how can you improve it?

And.

There you then basically follow the
system or the process he lined out.

Basically you first create a golden
dataset, which often consists of very

common queries, but also, basically
the other frequencies of query

If you can have them catch the
previous episodes, you basically have.

Three types of categories of queries
based on the frequency, basically hat,

torso, and tail and headquarters are
basically your most frequent queries,

but you always want to have, and your
golden data set, and then you basically

have tossed on tailor as well and
tailor like really specific various.

Which occur very infrequently,
but there is a massive amount of

different queries entertainer.

So you basically want to have
a little bit from each category

or each bucket or frequencies.

In your golden dataset and
then you want to use stat.

To test your retrieval system.

Whenever you make a change and
this will actually show you.

Okay.

You have your previous retrieval system.

You take the golden dataset, you
ride the cancer, then you evaluated.

And for the evaluation, you can use a
lot of different methods and metrics.

You can use, precision, recall MRR.

You can use nDCG and in most cases
you want a combination of these.

And.

But these, again, it depends a
little bit on the application,

especially whether it's like more
precision or recall heavy and.

Then you can, after a change
to the retrieval system, you

can run the evaluations again.

And actually see whether
you were able to improve it.

And especially when you're
adding completely new components.

So for example, you have an existing
search database, or you have a

Postgres instance on which you
are running your search justice,

full-text search, and you want to
add like a semantic social opponent.

You can spin it up and
quickly run the evaluations.

And when you don't really
see any significant dent,

you already, it points you.

If that it might not be
worthwhile to even implement it.

And.

Then you basically go more into
the direction that you want to.

Systematically, improve it over time.

So you need to establish processes.

So you want to actually
monitor user behavior.

And how they react to the results.

That's who the response.

I think.

And traditionally such application
that's a little bit easier.

Because the results are given
to the user and then you can see

based on the clicks and the dwell
time whether the user actually.

Was likely to have found the result.

Relevant.

Why I was talking like, so on specific,
like what's likely to fall because

you can never tell, it might have been
like a click baity headline and the

user just jumped into the video and
then basically you meet even back.

And this is also something
you should probably track.

So you can never be certain whether it
was actually relevant or he might've

clicked on the video and then just put
his phone away, but kept it running.

So the video is high.

That.

Time is good.

I saw the amount of time
he spent on the video.

But in the end, it doesn't
really say anything because

he didn't really watch it.

So in the end you have to
look at those in aggregate.

You have to use a combination of metrics.

So you can actually be certain that is,
it is a positive or negative signal.

And.

One support these monitoring
systems in place, you basically

can improve the retrieval system.

But with RAC, since you're
generating an answer.

It's a little bit more complex because
you're likely not showing the user.

All the results, but only the answer.

So you have to figure out
another way to basically.

And propagate back to
the retrieval results.

So when you have an answer.

and if you use a phone with useful,
like it gives you a thumbs up.

You can likely say that the
retrieval worked well, but in your

retrieval, there might be three
good results and three bad ones.

The model just made
something good out of it.

So it's way more complex.

To actually drill down and
get, , And get good feedback from

the user with answer generation.

So this is something that you
probably have to think about a lot.

Once you get into that place.

And.

This is also something you will.

Or likely,, I have to do through some
form of LLM or semantic similarity.

You basically look at the chunks
and see which chunks actually

contributed to the answer it gave.

And which didn't which are opposing
to it or completely irrelevant.

And then basically propagate back
the feedback only to the chunks.

Which we're actually contributing
to that specific answer.

And yeah.

I think it shows a lot of the complexity.

And also we always come back to the
two different types of search be in.

I'm 25 and Symantec.

And in most social altercations,
you're likely to use both.

But, I think this is also
something he mentioned.

If then if that's, if you
get good results with BM 20.

Five.

And semantic search.

Doesn't give you a
significant, better results.

You should stick with BM 25 because
it's just, it's easy to handle.

It's easier to manage.

And it's also.

Way more computationally efficient.

I think I've been
brabbling for a long time.

one more point.

I want to mention.

His, view on synthetic data.

If that it's more important to
actually concentrate on the generation.

To the system that's
actually generating the data.

I think this was really interesting.

That you are always should try to restrict
it into the balance of the system.

And.

The assistant probably is evolving
all the time as well, which

means that your synthetic data
generation should evolve as well.

But yeah.

This was the episode with Saahil, if you
like it, let me know, subscribe like it.

And if you're on Apple Spotify.

Leave a review.

It helps out a lot.

Otherwise, we will be diving deeper
into synthetic data next week.

So stay tuned for that.

Otherwise have a good one.

Talk to you soon.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#033 RAG's Biggest Problems & How to Fix It (ft. Synthetic Data)

Subscribe