S2E15

#032 Improving Documentation Quality for RAG Systems

November 21, 2024 · 46:37

Nicolay Gerold: LLMs hallucinate.

And often we want to put the blame
on them, but on occasion, Or even

often, it's also our own fault, many
knowledge basis we use for retrieval

have temporal inconsistencies.

So they might have multiple
versions of the same document

from different time periods.

They might also have historical
information without timeline context.

So for when this information was relevant,
they might have a reference problems.

So for example, implicit references
to something like it, "this system"

without explaining what it means, but
also undefined aliases and terminology.

Especially in large organization
terminology evolves over time and

an organization can build an entire
terminology for internal communications.

And then you can also have like
structural issues in the document itself.

So I'm missing semantic structure.

And that is on top of plain
mistakes and human error.

Especially with the RAG is these issues
are more pronounced because you aren't

looking at a full document anymore,
where you can be sure there's a certain.

Contextual relevance, but you're
only looking at bits and pieces

and these bits and pieces or
parts of the documents, chunks how

they are called often created
through naive chunking.

So you're just chunking based on a length.

So for example, like 400 tokens.

And this means we have to be
extra careful when we are doing

retrieval, augmented generation.

And are feeding.

Facts.

Into our LLM.

And today on How AI Is Built
we are talking to Max Buckley.

Max works at Google and has done a lot
of interesting experiments with LLMs.

And especially on using them to improve
knowledge basis or documentations.

For retrieval augmented generation.

And we'll be talking about identifying
ambiguities, fixing errors, and

also creating improvement loops.

For your document base.

So welcome back to our series on search.

Let's do it.

Max Buckley: What Anthropic are
proposing is that you have a set of

documents that you want to chunk.

And then for each of these documents,
you break them into some arbitrary

set of chunks, let's say K chunks.

And then what you do is you get Claude
in their case, but an LLM to, look at the

entire document and then look at the chunk
with the kind of command to contextualize

this chunk, given the document.

So one of the examples is that you might
have a chunk that says something like the

revenue was 43 million and that chunk on
its own doesn't really tell you anything.

If someone searches for something
about revenue, they will find it, but

the revenue for what was 43 million.

But of course, given the document, the
contextualized version is something

given the sec filing for this company.

The revenue is 43 million, which
obviously has a much more meaning in it.

And they use this contextualized version,
both for embedding and semantic search,

but also for keywords and lexical search.

If you had something like SEC filing or
the company, then you could potentially

find this chunk and not just the revenue.

Obviously there's different
ways of thinking about chunking

and contextualized chunking.

One thing we were thinking about
was like, if documents contain

intrinsic structure, you can obviously
maybe make use of that, right?

If you have a hierarchy of headings,
obviously you get some information

from the file name, file path, but then
you have some hierarchy of headings,

so you have a H1, H2, H3, potentially
that also contains some information.

Like you can, Flatten or ignore maybe
the sort of intermediate paragraphs, but

knowing the hierarchy of headings also
gives you some more information that a

sentence or paragraph may lack, right?

And so including that can help, but the
anthropic approach particularly resonated

with me when I saw it, because as I said,
it uses the large context model, so to be

able to process a whole document plus the
chunk, you need tens of thousands, tokens,

hundreds of thousands of Spanish document
But it also uses context caching, right?

So effectively, you pay to cache the
document, and then each incremental

thing is just a small paragraph.

So they have some number cited
like 1 dollar, 2 cents or something

per million tokens processed
which is obviously very nice.

And most of the major providers now
offer this context caching feature.

Nicolay Gerold: And are they doing it in
fill in the middle basically how we see in

masked training, that you have the stuff
previous, empty stuff stuff after it,

and then you fill in the middle chunks as
well, or is it only the previous document?

Max Buckley: And no, it's, so
it's the entire document up top.

So the section that you're
contextualizing is effectively duplicated.

So like exactly, it's the whole document.

Nicolay Gerold: Yeah.

Max Buckley: but you're right.

You could obviously do it the other way.

You could think only include the
prefix that you've seen before.

Yeah, a lot of these things,
they're more art than science.

You can try things and see what works.

Another interesting area here, as well as.

Potentially like contextualizing
the chunk, is to actually have like

multiple embeddings for the same chunk.

So if you have a chunk to have the
raw chunk embedded potentially like

a summary of the chunk embedded
and hypothetical questions that the

chunk might answer, if your system is
questions and answer based, and what you

probably will see is a very long chunk.

The distance from your
user query could be.

Large, whereas the summary might be a
little bit smaller and then the question

might be significantly smaller because
they're both in that same space, right?

That kind of question space.

Nicolay Gerold: Yeah, and I think
it's different problem they're fixing.

The one where you embed the questions
is rather, That the question and the

document live in different spaces and
are formulated in a different way, use

different lingo, use different terms.

And the other one is more about
losing relationships and losing

information that's important
in the context of the document.

Max Buckley: Yeah, exactly.

Exactly that, right?

That the query content mismatch problem
is the reason for the hypothetical

questions or multiple embeddings.

But exactly the contextualized thing
also makes it more semantically

meaningful and more complete And I
think this is something, a problem

that comes with chunking, right?

That as you said earlier in
traditional search engines, you

think about documents and a document
contains a lot of information, but

a chunk can be quite small, right?

It could be a few hundred characters
or like a hundred characters, or

it could be like a sentence or some
words, and obviously on its own, it

doesn't necessarily mean all that
much, in the context of a particular

document, maybe it means something.

And so it's just a way of
getting around that problem.

Nicolay Gerold: Yeah.

And especially relative sizes because
chunking tends to be used like when I

have really long documents and traditional
search engines, like the majority

of applications is around e commerce
where your text sizes are very small.

What would you say are The main
issues we face, especially around

the internal documentation, but
also around the approaches to

chunking and search at the moment.

Max Buckley: So I can only speak
from my own experience, right?

But I think that there's a huge
opportunity that sort of LLMs provide

for internal knowledge management.

And specifically, I think rag
is the kind of key right now to

unlocking that if you're in a
company where you have an extensive.

Set of documentation and not even
strictly speaking documentation.

It could be questions and answers.

It could be historical transcripts or
whatever that can be a huge junked

and synthesized and used to answer
future questions and really expedite

people's knowledge discovery process.

But what we have seen in Google, for
example, is like all sorts of like quality

problems, data quality, document quality,
effective problems where, for example.

There may be two documented ways
that something works, or one.

Maybe they're both correct at
different points in history, right?

Someone documented it on this file
at one point, then a few years later,

they documented it on this file.

And now we have two files that say
the opposite as to what you should do.

And only one of them can be correct.

Like where it says, for example,
do the database update in parallel.

And here it says, do the
database update first.

Okay.

One is correct.

One is not correct.

And again, this is something where I think
LLMs can really help to solve the problem.

Obviously in the past, you could
ask some engineer to go and update

documentation without specifying exactly
what to update or where to start,

and there's a lot of documentation,
there's hundreds of thousands of files.

So it's not a very.

Satisfying task, whereas an LLM can,
especially with large contexts, can read

in a huge set of files and you can ask it
for errors, ambiguities, contradictions,

and it can give you back a list.

And many of them, you can go and
very quickly knockout, or you

can just go and change one line of
documentation and that issue is resolved.

Can also identify things
like areas that are under.

Documented or areas that are
mentioned, but not explained.

And of course, these could also be.

Action items for people to document.

And then of course this improvement then
will be reflected in the next set of like

answers where you read in the newest data.

But obviously this is not
perfect process, right?

Given today's LLMs, you will
get some hallucinations, right?

If you ask for, give me 10 errors,
contradictions, ambiguities, some

of them are really useful, really
punchy stuff, other ones are.

Maybe borderline stuff like capitalization
typos, small potatoes and other

stuff, maybe just hallucinated or
wrong, it says, this is this, and

especially if you ask it for file
references, it can give it to them too.

You can go and check.

So you don't necessarily waste
all that much time looking at

them, but sometimes they're just.

Not particularly saying something.

So I think having high quality
data is key and different.

Data sources maybe have a different
sort of quality bar or like different

expectation in terms of like maintenance.

So for example, formal documentation
might be something that people

strive to keep up to date.

Whereas if you have an internal questions
and answer repo, maybe the expectation

is that you answered a question
and that was good when you answered

it and it's not necessarily good.

Several years later, right?

So how you deal in your system with
timeliness of data is a question, right?

You could be very clever and try and
figure out this system didn't change much.

So maybe the older answers are still good.

Or maybe you just have some hard
cutoff and say, look, let's just

throw away things from a few
years ago as being out of date.

Nicolay Gerold: But in the end,
that's really about identifying the

places where there is information
and not really fixing it.

And I think that's an important
distinction because for finding

disambiguities, using something else
than LLMs will be really challenging.

Because You would need to come up with
an entire NLP pipeline to figure out

okay, this is conflicting information.

You could probably supplement it with
some NLP, where you use like entity

recognition to identify the major
entities, and then basically flag them

and maybe reduce the document size,
and insert all documents where you

have a certain entity, and see whether
there are conflicts around that entity.

What are your thoughts on that?

Like, how do you figure
this system working?

You're thrown in to a new company
which has documentation issues.

How do you approach it?

Max Buckley: What you've just described
is obviously possible, but sounds

to me like a lot of work, right?

Like you'd have to figure out a lot of
systems to orchestrate the various bits.

And with LLMs, you get
this magically, right?

That if you can just read like the
markdown files out of a GitHub repo,

you can write a Colab that, reads in
100 files, sticks them in the prompt to,

large context model like Gemini or Claude.

And you ask it the question, how
do I, and if you have context

caching even better, right?

Because then you could ask a
sequence of questions over the

same hundreds of markdown files.

And for each one, it gives you back
some kind of answer, which lets you

go and either action it yourself, or
maybe you file, requests for people

to other people to action them in
case that you're not the expert.

LLM thing is really a kind of, you Speed
boost here, I think compared to the

prior, because also a lot of these things
are sometimes linguistically subtle.

And what I mean by that is I've
seen instances where, after

loading a documentation base.

Into the prompt and asking some questions,
I could see that the answers I was getting

were not what I expected as someone who's
worked on the system for several years.

And what was interesting was I asked
the model for direct quotes as to

why certain things are the case,
and it gave me exact quotes, and

I could find the offending string.

And effectively, there was this
funny kind of phenomenon where.

A system was built on top of under
other underlying systems, and one

of the underlying systems had a
particular limitation and the higher

level system, provides an abstraction
that you get around that limitation.

And generally, when people are
talking about the top of system

X, if they're talking about
system wide, in the context of X.

But in this one paragraph, they were
talking about just why on its own.

And they made this statement that
they said that this system doesn't

support this programming language.

And then the model just understood
that to be a fact or something.

And every question about,
does this system work?

Does X work with programming language?

And they would say, no, because of this
one offensive or like offending sentence.

And literally changing the system,
the sentence to be just more specific

to say, while the underlying system.

Does not support this program image
system X provides an abstraction on top

of it that allows them to be compatible.

And this like simple fix instantly,
meant that all those kind of questions

went to the way you should be.

You expect, right?

That the ambiguity was gone, but the
funny thing is as a human, if you

didn't know, and you just read that
paragraph, you would rightly make the same

assumption that the LLM had just made.

If you were a beginner in the company,
you would absolutely read and go, okay, so

I'll remember not to use these together.

And so making it better for the LM
also made it better for the humans.

Nicolay Gerold: And It's how would you
actually go about figuring out that

you have disambiguities in the first
place because in most companies, the

documentation is way too large and you
have way too many documents that you

could just throw all of it into a long
context model, or you would have to

basically go into an iterative mode.

And iteratively pass 100, 000
tokens into the LLM and just tell

it, okay, are there disambiguities?

How would you actually approach
it in a more targeted way?

Maybe linking it back to the
search queries or where you

didn't have a hit or didn't.

Give the user satisfactory answer.

Max Buckley: so yeah, I suppose this for
the first part there, about how do you go

from, if you have like too big to fit in
a large context, and even if you have a

say, so certainly that's the problem, if
you have too big, but I don't think you

can necessarily tackle a whole company.

In one go, but potentially you can
do like smaller, like departmental or

functional or system based improvements.

But if you want to go bigger than
that, if you want to dynamically.

Yeah, gather these documentation.

Then that's when you're talking
about the kind of retrieval,

augmented generation, right?

When you have some sort of semantic
search and or it does not be semantic,

but let's just say semantic search
over the chunks and you return the

chunks and you generate the answers.

If you have one of those
systems, of course you can

gather like user feedback, right?

You can allow for thumbs up, thumbs down,
and potentially even like a qualitative,

like they can write a free text.

So what was wrong or right.

And even just from using the
system, maybe you can see that Your

chunking strategy is limited, right?

As a developer, you can find that
there are chunks that are maybe

useless or meaningless, but for
some reason are still being ranked

highly and still being retrieved.

As in the chunk itself might have like
nice headers, but then just say to do,

and the name of the person who's supposed
to actually write the documentation.

And of course, this is not
something you ever really need to

retrieve for any particular reason.

But your model retrieves it and then.

Suddenly it has fewer chunks, fewer
realized chunks than it would have

otherwise or the one that we commonly
see is with re ranking models that are

pointwise that score each particular
data point, given the query individually,

they often find a good chunk, but then
they find like pseudo close to duplicates

of it, given like we have a lot of
redundancy in documentation, especially

if you look across different teams.

So the re ranking model will often
give us like four of the same chunk

from different teams that have all
recommended the same best practice.

And you're like, okay we'd like
some diversity of choices there.

There are list wise re ranking models
that can get around that problem.

Sorry, I feel like I went on a tangent
and didn't fully answer your question.

That's.

Nicolay Gerold: during generation.

I think like we are at the moment
building an agent for coding.

And in that we feed in documentation,
but every time you make a generation,

we cross check, Hey, is this actually
consistent with the documentation?

Because of the priors of the LLMs, it
often makes a wrong choice in how to use

basically a certain programming language.

So this ambiguities are useful
in like multiple different ways.

And I think like, when you.

Can ask specific questions.

It's always better.

So you always have to think about
basically, I'm not looking for ambiguities

in general, but rather I'm looking for
specific ambiguities, which should be

informed from pre existing user queries,
which may be failed, or maybe it didn't.

return good results or where the
generation completely failed or where

the NLI model told you, Hey, this
answer is not factual consistent

with the input you gave the model.

Max Buckley: So we are, I haven't
really looked into these recitation

checks for like rag systems where
you can like annotate your data yet.

Only started looking at them recently,
but so that, if we're giving a large

number of chunks that we can say what part
of the response came from, or was like

heavily influenced by a particular source.

Which is obviously nice because now we
can Both visually, we can show it and say,

okay, this section came from number one,
or this came from number three, but of

course it also lets us see if there was
some chunks that we didn't use at all,

and that also lets us programmatically
get data as to which ones were actually

used by the model itself, right?

Without relying on the user,
thumbs up, thumbs downing it.

Because if we're retrieving a chunk
and never using it, then not such a

useful chunk for this set of queries.

Nicolay Gerold: Yeah.

And I think like the most
challenging part is errors.

So I would love to know, like, how
did you find this concrete error where

the fact that was in the documentation
was actually just plain wrong because

ambiguities, you have the same
information stated in different ways.

So this should be like easier to
identify, but clean errors in the facts

are really difficult in the end to find.

Max Buckley: Yeah.

In that particular case, I had
the benefit of being a domain

expert in that system, right?

So when I was playing with this large
context model, I was specifically

asking questions that I thought
like a user, a beginner would ask.

And I was like looking at the answers
knowing what I wanted or expected.

And when I got something that kind of
disagreed with that, I was able to fix it.

But what I also found interesting
was just asking for errors worked.

Okay.

I'm not going to say perfectly,
it doesn't find all the errors

necessarily, but it also worked, right?

I asked for errors and it just
sometimes gave me these funny

concrete examples or other ones
were ones who were like, I didn't

personally spot them, nor did the LLM.

But one team who I was working
with came back to me after a few

weeks and they said that they were
very happy with the LLM responses.

One problem was that the
model kept recommending people

come to their office hours.

And they don't have office hours.

And I was thinking pretty sure that
documentation says they have office hours.

And it did four times.

So in four different parts of the
documentation said if this doesn't solve

your problem, come to our office hours.

So they were saying it's a hallucination.

I was like, quickly removed it.

It hasn't happened since.

Nicolay Gerold: Go ahead.

Max Buckley: yeah, no.

I guess the line of what is an error is.

Difficult to quantify, right?

I mean, and even if it's independently
verifiable, if you, if the system,

like if the documentation says two
things that contradict each other

that there are two options and
somewhere else says there are three

options, then only one can be correct.

For something.

But.

Other cases, it's hard to, the model
couldn't have known that office hours

there are no office hours because it
wasn't even written somewhere that

this team do not have office hours
So if we had that situation, it

would obviously have been a explicit.

the obvious or something more
obvious and fixable, but it required

a kind of a user to point it out.

Nicolay Gerold: Yeah, I think it
only shows you like search systems.

You can never look at them in isolation.

You in the end need a complete team to
actually fix it, like subject matter

experts, but also you use a feedback to
actually point out the wrong information.

What I'm really curious about is to get
your thoughts on like misgenerations.

So if the model always adds information
or includes something, And that's a

bias from his, it's training data.

And I think like one concrete example is
always like when you ask models to write

something for like anything, social media,
it always falls into a certain style.

And I think this is also an interesting
direction that your documents you have

also feed into the context could bias
the model into a certain direction.

I would love to hear your thoughts of
like, how would you actually change

the documents you have to actually.

change the behavior of
the generation as well.

Max Buckley: Know what you mean in the
sense that if we're putting Markdown into

the, as a documentation, the model will
like just start spouting Markdown in its

response, which is actually quite useful,
but we don't have to say you responded

Markdown or whatever, it will just do it.

Bye.

Generally, like my experiences so far
with using Gemini for generation has been

pretty good in terms of it not answering
when, or if something is explicitly

outside of the context so if we're asking
it some question, like, how many legs

does a horse have and, retrieval from
the internal documentation and questions.

It doesn't have any mention
of the anatomy of a horse.

It will usually say something
like, Oh, this question is not

answerable given this context.

So I haven't seen it necessarily do some
sort of like in domain hallucination too

much but I haven't got an exact number.

On that terms of the actual
sort of style that it picks up.

Difficult to say like I've definitely
seen it, like hallucinate when,

especially in like longer responses,
but again, not hugely badly.

I would say I think, rag in general
is good for reducing hallucinations,

versus just like prompting the
model and asking the question.

So that definitely helps.

As I said earlier, like when asking
for like contradictions or ambiguities,

probably if you ask for 10, two are.

Misunderstood,
misinterpreted, whatever else.

Yeah, there, there are some
there, but not a huge amount.

I would say it's, like what I was
doing with some of this was generating

a document with the kind of models
response with here's 10 ambiguities

in your documentation and sending
it to some other team and saying,

Hey take a look at these 10 things.

And people would come
back to me with fixed.

Not a real problem or maybe this
one doesn't, isn't really ours

cause it is owned by the other team.

So we're not going to do a
little bit and then fix fix.

And then maybe hallucination and okay.

I'll take it.

But in terms of like, the models
prior hard to, I haven't done any more

detailed analysis to say that it's.

There are not, one thing
is the case, right?

Is that like, in a company that's
been there for a long time, a larger

company, you're going to have a lot of
internal like evolution of language.

So a company that has hundreds of
thousands of employees has been

there for 25 plus years is going to.

Have its own internal vocabulary,
its own, way of naming things.

And what's interesting is if you use
a model like a model, but let's say

for example, Gemini, like a priori,
it doesn't understand the internal

language of a company like Google,
or it doesn't understand, how we use

certain words, given the like rag or
large context, it picks them up on

the fly, which is quite impressive.

The in context learning there is nice.

If actually we're asking a question that.

A bunch of terms that are like, I wouldn't
say meaningless, but they don't have the

same meaning in the inside, outside world.

And in some cases we're using an
internal or sorry, a externally

available embedding model, right?

So that embedding model doesn't
correctly embed the relationships

between some of the words, because
it's unknown kind of vocabulary,

limited their words, but that is

Nicolay Gerold: Yeah.

Do you think it, it would be easier to
run this only on the retrieved context?

So basically, on every query the user
gets, launch a background process,

which just asks, hey, are there
ambiguities considering this query

in this context and feeding like the
10, 20, 30 chunks we just retrieved?

And just run this continuously
because you have the contextualization

of the user query as well.

Max Buckley: You've certainly could
and write it out somewhere, like in the

background that sounds like a great idea.

Maybe I'll do that.

Thank you.

Absolutely.

What's interesting for us at least,
is that, I was giving the example

of like terminology It terminology
is difficult as you go broader, like

across the whole company, because
some of these systems could be, the

names could be used in different ways,
different kinds of different teams.

Being consistent is easier within
a particular team's code base

then across the whole company.

So just like.

Loading our use of the word and trying
to align that is a lot easier than

trying to align across teams or across
the whole world, the globe as it were.

But yes, you're right.

Especially when we retrieve like across
the entire company, we're going to get

lots of different chunks and maybe they'll
stay different things or wherever else.

It will be useful for us to keep
track of these, errors that kind

of, that happened in the background
and then actually fix them.

I didn't, hadn't thought of that.

So yeah.

Nicolay Gerold: Have fun
implementing like great weekend.

How do you actually
think about definitions?

Because I imagine you have quite
a lot of terms or lingo you use

internally for different systems.

So how do you actually think
about the use of definitions in?

And when to actually add the definitions
also into the context, or do you always

just run the rag and the definitions
are parts of the chunks as well?

Max Buckley: So this is actually
one that we have a bit of more

and more detailed thoughts on.

So actually that was something that
was some other people in my team found

in earlier for LLM efforts, right?

That the LLMs were just not aware of
Google internal semantics and vocabulary.

And so they actually found that
adding a kind of a quick glossary

of terms to the kind of beginning of
the prompt or like I suppose in the

system prompt really helped in terms
of the tasks they were trying to solve.

When we moved to large context approaches,
then a lot of these could be picked up

almost implicitly by the model, right?

Given like the model of dozens of
hundreds of files, it sees all these

terms being used in many different
contexts and seems to be able to

intuit what they actually mean, right?

Or some of them will be defined
at some point and other ones

would be maybe just implicit.

Though one funny thing, funny failure
mode again, that we saw with the large

context approach was many, as you said,
many systems have a lot of aliases, right?

Like different people refer
to it as different things.

They may be used the code name that
was used when it was being developed.

Maybe they use the external
cloud name for the thing.

They, as opposed to the internal
name, there's a lot of different

aliases for the same thing.

And when we were like
looking at some of the.

Questions, answer system data.

We noticed that there was a particular
subset of queries that were using

a particular name for a system and
not retrieving the kind of relevant.

Content and effectively they
were using the name for the back

end of the system instead of
what most people refer to it as.

And so the fix in this case was literally
adding one sentence to the documentation

that said X is a backend for Y and
every query that we've had since that

asked about X found the right thing.

But until that point, just that.

If we had the entire documentation of
the whole company or potentially we

would have got around that problem or
like semantic search or lexical search

potentially there, but this was when we
were just using large context and the.

Documentation for a particular system,
so that one sentence fixed those subset

of user questions, which was pretty cool.

But yeah, as I say that in general,
right now, a lot of what we're

looking at is semantic search, so
we're embedding the query and it

does seem to find the matching.

Terms.

If we don't like, even though we're
embedding it with a, maybe an X in

my current project, we're using an
externally available embedding model.

But it doesn't necessarily understand
the full semantics of that internal term,

but it still finds the things that are
similar and finds the correct thing.

Nicolay Gerold: Yeah.

And on a previous podcast with Nils
Reimers, we talked about like knowledge

editing in embeddings is actually
way easier than in language models.

Yeah.

So the source of failure is probably
rather the language model, because it

will, from its prior knowledge, mistake
the definition or the term for something

else, as opposed to the embedding
model, because in the embedding model,

if you fine tune it, you can basically
ingrain it as a different entity

based on the similarity you give it
with, through the continued training.

Max Buckley: And so we've
had good success here, right?

Some systems for example, have
are just a normal external word.

So if, if you ask about like guitar,
for example, it's a system at Google,

but also a real musical instrument,
depending on the approach that we've

tried, but if you do like rag, of course,
you return a bunch of documentation

about how do you set up the system
and very little to nothing about how

do you play the musical instrument.

So the LLM, It does a good job of
answering the question about the

system and not the musical instrument.

Probably if you asked it about
the musical instruments, you'd

have a different challenge.

It would probably understand
that you're just talking about

it and it would continue.

Like it would say, state something
like I said earlier, where it says

that there's no information here
about playing the musical instrument.

Nicolay Gerold: Yeah, I would actually
love to know like your thoughts on

like more context misinterpretation.

So when I have, for example, the sentence
like product A was replaced by product

B in 2020 and product B costs 100.

And then I have a user query
with this context, like how

much does product A cost?

Which information just isn't available,
but even if it's available, it

could be like screwed up by VLLM.

So have you spent any thoughts or
efforts of actually like minimizing

context, which could be misinterpreted?

By a reader or by an LLM

Max Buckley: So yes.

So I say I did do like myself, like
trying to for a particular system, more

using it as a kind of proof of concept.

Like where I found these kinds of chunks
that were ambiguous or unclear, or,

where the thing was citing something,
I was like, I found that it was quite

fruitful to like literally fix those
simple sentences if it could be done.

Now, of course, that doesn't scale to.

whole company.

But one thing I would like to do here, and
I haven't done this yet, is to potentially

use these things at pre submit time.

So to actually imagine that
someone is changing and this

obviously can be in Google or it
can be externally done as well.

If someone is changing a markdown
file, someone is like modifying or

updating the documentation in a company,
it will be great to have an LLM.

Check that change.

At the simplest thing, it could be
checking it just for like English

readability or like language of
your choice and readability, but

also like how clear is this, right?

Because sometimes people write something
like start a sentence with it and they

don't say what it is, maybe it's a new
paragraph, and of course, better to say

what it is, especially if it gets chunked
and, we don't rely on the header above it

or whatever else, but also just actually.

Does this sort of section make sense
in the context of this file or even

in the context of the whole directory?

And this way you could prevent some of
these issues going in the first place.

So you could improve the quality
of things that are getting

submitted, which of course in
turn would improve the quality of.

Things that are retrieved.

And I also think that if someone
is modifying the documentation,

they are, they have it loaded into
their personal context, right?

They're perfectly positioned to
read that, like automatically

generated comment on their change.

It says, prefer this or try, maybe
rewording this way, or, break the

sentence into two sentences, like it's
simple stuff, but it makes the thing more

readable to a human and also to an LLM.

Nicolay Gerold: Yeah And it
might even be interesting.

That's just more of a crazy thought
because it's another document The

similarity should work very well with the
already existing documentation You could

just do a naive chunking on the entire
existing documentation and run a retrieval

on it And feed all of the, like above a
threshold into a long context LLM again,

and see whether there's any contradictions
in there with the piece of documentation

you're currently writing in on, and then
you just highlight it to the writer.

Max Buckley: yeah, you certainly could.

And you certainly could, I
was like, that's a good point.

Like you could like, given what
they've written, look for that, find

the similar things and tell them
that it's similar to this or that.

In theory they can re, reuse those things,
they can embed or include those other

documentation in their documentation
rather than having to duplicate it.

They can potentially use it to
sanity check what they're writing,

especially if they're not the
team that owns the system, right?

If they're just a sort of third party
team, that's writing a kind of summary

of what the system is for their team.

Do they want to be correct?

But I really like your idea here
of even just for example, if

we have the rack system set up.

Setting like K to a large number,
like a hundred asking a question.

And then with that a hundred things
in the context, you could ask for

contradictions, errors, ambiguities, and
you would effectively get it across the

entire code base for a particular query.

Nicolay Gerold: Yeah, I think
like we talked about a lot.

I just want to have a list of
stuff to look out for, which

is like errors, ambiguities.

Contradictions out of data information
and consistent information, lost

relationships, missing definitions.

What else is out there, in your opinion,
that people should look out for in the

documentation when they're doing RAC
or in general when they're doing RAC?

Max Buckley: Say the contextual retrieval,
contextualizing the chunks, right?

The most naive approach to chunking
is to do say, a hundred character

sequences or something that this
obviously is bad for many reasons, right?

Because you're splitting
potentially words.

There's obviously more natural
things where you split on

new lines on paragraphs.

But of course.

Not all documents are well behaved.

Some people you'll find that when you
do your clever chunking strategy beyond

a hundred characters, that some people
have put everything in one sentence or

everything in one giant wall of text.

So you do need some kind
of fallback heuristics.

But even when you do these things,
you need to say, are these chunks

meaningful or like what extra information
can we capture with the chunks?

And if you're using.

Like any kind of structured data format,
then you can potentially have the chunk,

maybe the headers that are above it,
the file reference, maybe you can have

additional data generated by LLMs, right?

You could have a summary of the
document or some, you could have, like

I said, a contextualized embeddings
or questions, all of these things.

But of course, all of these things
rely again on the quality of the.

Source data.

Nicolay Gerold: Yeah, and I think
the two different types of contextual

chances, like basically position, but
also information, which is just missing.

So basically, if I have like this
without a reference to it, these are

like, I think the two major types.

Especially I think in anthropic
case, they rather take out the.

Second one, so basically the missing
information and not the position

of one in the position of one, you
can get into a lot of areas, like

depending on the type of information
you have, I see, I've seen html rack.

I'm not sure whether you've seen
that paper who are basically trying

to do something similar for html,
as you mentioned, markdown but

only take semantically meaningful
elements in the html tree, which

is also really interesting.

Nice.

And.

I think the LLM space, it's like
such a hyped area at the moment.

I always like to ask people like
underappreciated and overhyped.

What do you think?

Like in the data and AI space, what's
an underappreciated technology and

what's like an overhyped technologY?

Max Buckley: I mean, I
feel LMS are, clearly not

underappreciated or under hyped.

Fear we're at the cusp of an AI winter.

I like in the sense of, not that the
things aren't great, but just that

with the scaling, slowing down some
of the momentum will will be lost.

I think that's not a problem in the sense
that I think that RAG is hugely valuable.

I think that for a lot of companies,
for a lot of use cases, this.

Can solve the problem in a way
better way than we had before.

Like a lot of like customer
service type use cases.

This is amazingly useful.

Like you could have a small team
of elite customer service people

supporting a huge army of bots.

And again, just every.

Correct answer that they give becomes
like gold for future iterations.

And that's what is happening
with the documentation here.

If you can document it correctly once,
then it can be found and synthesized for

hundreds of user questions going forward.

And that's super valuable to
me because I don't want to

answer the same question twice.

Think that like right now we're
obviously very focused on LLMs.

Obviously traditional supervised
learning and all that is still

there, but not getting as much hype,
but I think that's the, useful.

My feeling is that things like AlphaFold
will be more significant historically

than LLMs, but that's just my intuition.

I wouldn't say it's not hyped.

It's obviously getting its own traction,
but it's not as accessible to normal

people, I think that as ChatGPT are.

Nicolay Gerold: I think in the future
we will see more and more aI systems

where you actually combine multiple types
of AI to actually get a good output.

And I think RAG is already
an example for that.

You have a retrieval system
and the LLM for generation.

And usually your retrieval consists of
multiple components as well, which often

include like more of the traditional
AI in the form of classifiers.

. . Max Buckley: I think that one thing
that I think will happen out of this,

like huge hardware buildup for LLMs, it
looks like LMS are seemingly plateauing

according to some of the news I've
been reading in terms of scaling.

But all of this hardware build out will
make it much easier and cheaper to do

all sorts of data science going forward.

You won't have as much of a struggle to
get access to the GPUs if someone went

and built millions of them for lms.

All sorts of other sort of supervised
learning use cases and unsupervised

learning can be potentially done.

I think it would be a lot easier
to found like a deep tech or a

deep RL startup today than it
would have been several years ago.

Especially if my suspicion
about the AI winter comes true.

Nicolay Gerold: Yeah.

And also with the trend.

That they are doing specialized
hardware for LLMs and then LLMs are

here to stay, like they will probably
move on to specialized hardware

and then you have an abundance of
GPUs for other types of models.

Max Buckley: One thing I really liked
watching your podcast, the other

episodes was like, one thing that's
interesting is, search is not a new space.

But I feel like with rag, it feels
new to a lot of people, or it's like

a discovery to people, like people,
like there's all these fields that

are partially overlapping and they're
converging in this kind of fun way.

So it's very interesting to see, even
within Google, like different people

have approached rag very differently.

Some people are really just focusing on
query, like actual prompt optimization,

trying to get the prompt right to
make the model work a certain way.

And then other people thinking
much more deeply about the actual

data like myself or maybe some more
people focus on the infrastructure.

And it's interesting because
people are just coming at it

from different perspectives
with different kinds of priors.

So I think there's going to
be even more fusion of these

communities over the years to come.

Like the, I think we're
democratizing some of search.

Like I think that rag type use cases make
search useful in all sorts of domains and,

multimodality music, even be more domains.

Nicolay Gerold: Thing is like search
is having a renaissance, but we first

basically had to rediscover a lot of this
stuff, which was already figured out.

And I think that's always been like,
there is a new technology hype cycle.

You combine it you figure out new ways of
using it, and then you're building on old

technologies without really knowing or
thinking about the old technology, which

would save you most of the work, probably.

Max Buckley: yeah it's fascinating.

Like for me, at least I took some
Coursera's on back in like 2014,

maybe on like how search engines
work, the kind of old TF-IDF, BM25

things and built some simple things.

And I hadn't really thought about search.

I obviously use search, but I
hadn't really thought about it or

built search systems until kind
of end of 22, beginning of, 23.

So it was interesting to go back and see,
okay, there's been a lot of progress.

And again, some of these things work
really nicely just now they're so easy

to use, like it's so easy to prototype
with some of these LM applications.

Like this one thing I love with
large context that you have some

documentation of your company.

You can just slap that in there,
ask a question if it can answer it.

You're like, all right.

It works, it's a business
value in a couple of hours,

Nicolay Gerold: yeah, and what is
something you're excited about, or

what is something you would love to
see built in your field or around it?

Max Buckley: Very excited
about all things AI.

I think it's a very exciting field.

I have this like lucky coincidence that
I got into ML in 2012, end of 2012.

I just discovered the
term machine learning.

So I've been like chasing this
for a while and then it suddenly

became the coolest thing ever and.

22.

So it's been very interesting to
watch while I'm new to search,

I'm not new to machine learning.

It's really interesting to see
how this whole thing plays out.

It's I wouldn't have thought, I don't
know that we would have had touchy with he

publicly in the end of 22 and I was taken
back by how good it was and like have

since been just impressed by the progress.

I'm happy to see where it goes next.

Riding in those self driving
Waymo's is also pretty cool.

I can attest to that fact.

Nicolay Gerold: I think
are they in Zurich?

No, right?

Max Buckley: No.

In San

Nicolay Gerold: I think in Europe
you will have to wait a few

decades before they come here.

Max Buckley: yeah I'm a little bit
worried about Europe's approach to AI,

to be honest like it feels when you're
last place in a race, it's not a great

idea to tie your shoelaces together

Nicolay Gerold: Yeah, US
innovates, we regulate.

Yeah, that's the fun stuff.

Nice.

And if people want to follow along
with you, where can they do that?

Max Buckley: LinkedIn is
probably the best place.

I post stuff reasonably regularly.

Okay.

Nicolay Gerold: Yeah.

And I will put that in show notes
and Max is a really good follow

especially when you want to stay
up to date on a lot of research as

well in like the retrieval space.

I think I've been abusing Max's post
a lot to not do the work of actually

finding the interesting papers.

So what can we take away?

I think.

The key idea.

Is to make your
documentation ready for AI.

And I think this is a theme
we will see more and more.

Across the different types of documents
and across the labor we are working

with, so for example, also, In code.

There are a lot of work will be
spent on making code basis or

documentation or anything else.

Ready for AI.

So actually we can use LLMs, optimally.

We can send agents in there to fix
bugs or to write some new features

at some point in the future.

And I think first of all, we have to fix
up what they are in the end, working with.

Because we might have like inconsistent,
documentations, some wrong facts in there.

or ambiguities all of
which has to be removed.

So AI doesn't make mistakes.

We try in the end then
cost by factual errors.

And for search specifically, I
think it's very valuable to make

like the different document quality
issues you might encounter explicit.

And just to be on the lookout for them.

And to do spot checks.

Or just free feed, like a large
portion of it where you might.

Expect failure cases or which
is like a particularly important

part of your documentation, feed
it into a long context window.

LLM.

Like Gemini and just ask it whether
it can find any inconsistencies,

any ambiguities or whatever.

I think in general, the more
sensible approach would be to

actually do that on the fly.

So actually when you're running a RAG
system, Um, you probably already ranking

anyway, so you can do one additional.

Model call without.

Immense cost.

It depends on your load in the end.

But you can, if your load is too high,
that you don't really want to spend,

like, um, the model cost of feeding
all the chunks into an extra LLM and

asking, Hey, are there any ambiguities?

Um, If then you could do spot checks.

So on 10% of the, of the generations.

Just take all the chunks.

Throw it into a long context alum.

Ask it.

Whether there are any ambiguities, any
factual errors, and when it classes

classify some ass factually inconsistent
or whatever, just throwing error.

Maybe read, retrieve.

And with something else, for
example, fall back to BM 20.

And actually throw the error or a warning.

So your search people are notified and
can actually look at the specific chunks.

And correct them themselves
or with an expert.

And I think this already is.

A good feedback loop.

To be consistently improving the
materials you're working with.

And.

I think this can get you out of that, that
you have like a lot of silent killers.

In your search database.

Because a single ambiguous sentence.

Might.

Corrupt.

And entire set of responses and
fixing those often, it's like

not enough work, but identifying
them is really challenging.

And especially if you have a really
large search database, Finding

this is completely infeasible,
so you have to do it consistently

and over time and continuously.

Eradicating the mistakes.

We will be continuing our series on.

RAG explicity next week.

I would love to know what
you think of this episode.

, let me know in the comments.

Also like, and subscribe.

I will talk to you soon.

Catch you next week.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#032 Improving Documentation Quality for RAG Systems

Subscribe