S2E20

#037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces

January 3, 2025 · 49:13

Nicolay Gerold: The biggest lie in
RAG is that semantic search is simple.

The reality is that it's easy to
build, it's easy to get up and running,

but it's really hard to get right.

And if you don't have a good setup,
it's near impossible to debug.

One of the reasons it's really
hard is actually chunking.

And there are a lot of
things you can get wrong.

And even OpenAI boggled it a
little bit, in my opinion, using

an 800 token length for the chunks.

And this might work for legal, where
you have a lot of boilerplate that

carries little semantic meaning,
but often you have the opposite.

You have very information dense content
and imagine fitting an entire Wikipedia

page into the size of a tweet There will
be a lot of information that's actually

lost and that's what happens with long
chunks The next is overlap openai uses

a foreign token overlap or used to And
what this does is actually we try to

bring the important context into the
chunk, but in reality, we don't really

know where the context is coming from.

It could be from a few pages prior,
not just the 400 tokens before.

It could also be from a definition
that's not even in the document at all.

There is a really interesting solution
actually from Anthropic Contextual

Retrieval, where you basically pre
process all the chunks to see whether

there is any missing information and
you basically try to reintroduce it.

So today we are back continuing
our series on search.

We are talking to Brandon Smith,
who is a research engineer at

Chroma, the vector database.

And he actually led one of the
largest studies in the field.

on different chunking techniques.

So today we will look at how we can
unfuck our rack systems from badly

chosen chunking hyperparameters.

Let's do it!

So what do you think what's
the motivation for chunking?

why do we actually need it?

Brandon Smith: That's a good question.

To be honest with you, I do genuinely
believe chunking is something we do we do

chunking because we have embedding models.

And in a better world, we wouldn't
actually chunk at all, right?

In a better world, we would extract
the tokens that matter for a given

query and bring them back perfectly.

But that's not what happened.

It turns out we can make
some very powerful embedding

models that are very strong.

But they obviously, of course,
need they need an input string.

You have to give them an input string.

And they need context as well, actually.

I guess it's Like fundamentally,
you have a query, and you have some

passages in the text that you want
to retrieve for that given query.

And of course, the way we know to do this
is that we embed the simple way you embed

the query, and then you embed a bunch
of chunks of the documents, and then you

retrieve that chunk for the given query.

Ideally a better system in theory,
would actually take the query and

really read along the entire, like
if you gave it to a person, they take

the query and they'd read along the
entire corpus with that query in hand

until they see something that matches
and they'd go great, this is good.

These two match.

But what we've actually said is when we
like embed our corpus, we just do it once.

It doesn't matter how the query
changes, at least in general, in

standard RAG you just do it once.

And we do that because
it's really efficient.

We embed the entire corpus
and then every query, they get

their own separate embedding.

And we just do a little outer
product between them whenever.

And really, fundamentally,
we chunk because it's just

efficient and it's quick, right?

A better way.

And as I said, this is, I
believe how Colbert works.

My boss also did some papers on
that on Anton is you can actually.

It's not like you just chunk embedding.

The process is done together.

You hand the query and like a
chunk and the two then make the

similarities bound together.

You can't just have their
vectors held in place.

But yeah, to summarize the
question, why do we chunk?

I'd say it's because
it's efficient and quick.

That's why we chunk right now.

Nicolay Gerold: Yeah, basically avoids
to re embedding the entire corpus

we have, which can be like millions
of documents on retrieval time.

Brandon Smith: absolutely.

Exactly.

And yeah, and it adds latency, right?

We want this to be quick.

If you're using a chatbot particular, you
don't want the user waiting while it re

embeds everything to get something back.

So yeah.

Nicolay Gerold: Yeah, and we basically
trade off the accuracy for the speed

of retrieval, but we can Basically add
it back in by using rerankers, just

basically doing the stuff you said just
after you retrieve like a candidate set.

And what is the main issues
with current methods?

I think we have seen so many different
chunkers, especially in Langchain,

like the markdown one, HT ML splitters,
but I think most people use like the

standard one just based on length

Brandon Smith: Yeah.

A hundred percent.

Exactly.

I think the current issue with
the issue of chunking these before

our paper was that no one knew.

No one knew it was better.

There's actually a lot to be said.

There's a whole thing that I'm
going to make some posts about,

which I think is really interesting.

No one knew what was better.

Maybe there's a bit of a bold claim to
make, but I believe OpenAI themselves

also didn't know what was better
because there's a chunking strategy

they use that wasn't very good.

And we'll get into that.

But yeah, no one knew what
chunking method is better because

there's no way to compare it.

There's no metric.

There's not really a metric.

Of course you could like do the
standard RAG metrics and test the

entire pipeline, but you're not
really isolating the chunker, right?

You ideally like to isolate
the chunker and say, okay let's

just compare the chunkers.

And so that was the current issue
is that yes, as you say, there's

no way to see which is better.

Fundamentally one could also, okay.

What would we imagine a good
chunking strategy would be?

And it's okay what is it
at the end of the day?

There are two things that matter.

Really, all that matters is that
for a given query, we retrieve

exactly the relevant information.

That's true.

That's actually it for any given query.

You want to receive only the tokens
that relate to that exact query.

That will be the ideal
best entire RAG pipeline.

In practice, we don't have that.

Extra stuff come along.

So you say, okay, fine, extra stuff
could come along, but we want as little

extra stuff as possible to come along.

And a chunker, that means
tons of extra stuff happens.

It's a very bad chunker, right?

So you can say the most trivial
chunking algorithm will be one that

doesn't chunk at all and literally
hands the entire corpus over.

That's a theory, like a possible solution.

It's horrible, though, because you get
tons and tons of users to use the chunks.

Of course, then we have what became
really popular, are just the fixed,

like the fixed token splitter
and the recursive token splitter.

And they seem very sensible at first.

I think the fixed token splitter
is the most natural next step.

It's okay let's just split
it into fixed lengths.

And then we chunk all of these.

And yeah, it's okay.

But that could break right through a word.

It's not ideal.

And then I could go into this.

Okay, I'll go into this.

I'll hop into this, right?

Basically then okay, now we're
in on the idea that we want to

break the text into pieces, right?

We have now decided that we're
going to make this system.

We will actually embed
everything one time, right?

Because we mentioned at the start that the
ideal system wouldn't actually do that.

It would do a bunch of other stuff, and
it would be specific on the query, but

because to keep the latency high, we're
just going to embed everything once, and

we just want to find, and to be honest,
our paper mentions that the focus really

is that, Given that we want to just embed
everything once, what is the best chunker?

Because people have mentioned
actually, if you re embed the stuff,

you can maybe do something better.

That's not really the focus.

Given you only embed everything once what
is the best chunker you could do there?

Fundamentally obviously, we
wanted to retrieve only the

specific tokens for a given query.

But, it's a practical world.

It's a real world.

The embedding model has to have enough
context inside that chunk to be able

to bring it back to the query, right?

For example what might like, I don't
know, there might be like, let's say, as

an example of there's a passage and it's
about some guy, john, and he said, john

buys the far john buys a red Ferrari.

Full stop.

It is the fastest in town dot.

And somebody queried and was like
how, like, how fast is John's car?

Let's say our chunking method just
literally broke the, it is the

fastest in town and made that its
own individual chunk that would

actually never get retrieved because
it's got nothing to do with a car.

It just said, it's the fastest in town.

You've actually completely
lost all information by just

breaking that into its own chunk.

You just completely made it useless.

You've just deleted it.

So there's even though that's
exactly the right answer, a person

could see would just pull that thing
saying it's the fastest in town.

The embedding model has no hope of
ever retrieving it because we've

just, we've broken information,
we've removed the flow of things.

So that's that's when it comes to
breaking things down like that.

Fundamentally, I guess my point
with that is that you have to keep

the context within the chunks.

The chunks have to contain context.

That's what matters.

And yes, with all these different
tracking methods, ideally you have

something that has context within it.

I'm going on a bit of a tangent
here and I feel like I'm covering

a bunch of other questions.

So feel free to break me.

Maybe we should re orientate ourselves
before I go off that Dallas pipeline.

Nicolay Gerold: and no, I like it.

So I, how this is, was.

One thought I would love to get your
thoughts on this as well was my idea

of MECE chunks, like mutually exclusive
but completely exhaustive, which is

something I'm like try to get to, but in
the end what I landed on is like one of

the chunkers, which you also basically
implemented in the paper, which is more

of the semantically oriented chunker,
that you try to basically make the chunks

as semantically coherent as possible.

Can you maybe start with these?

What can it do?

How does it work?

Brandon Smith: Yes.

A hundred percent.

Yeah.

I thought a lot about that.

Yes.

Because it really comes back from the,
that, that initial thought experiment,

the one with the guy had the fast car,
it's the fastest town, whatever, right?

The chunk, you could look at the chunk.

It was the fastest in town.

If that was just the entire chunk.

And immediately you can say
this is missing context.

There's clearly missing context.

There's a pronoun here.

I don't even know what
word it's about, right?

You could almost look at the chunk itself
and see if it's a bad chunk by being

like this is just incoherent, right?

So we can, you can
criticize the chunks itself.

Ideally, what you want is a
chunk that produces chunks that

every chunk is super coherent and
it's completely self contained.

And that's what we did with the The
cluster semantic chunker and mind you

actually, I won't lie that's one of those
chunkers where I think, I don't know

if you saw the paper, there's a picture
of there's a picture that I think makes

that chunker so clear and so obvious.

And that's actually how I derived, I
broke the thing in the first place.

What I did was I started off with a
recursive a recursive token chunker.

And I just split the document by 50
tokens, which is like very short, right?

50 tokens is like what?

Slightly under 50 words.

It's like a sentence, maybe two sentences.

And I did that to, I wanted to split
by a sentence, but you never know if

there's no sentences, blah, blah, blah,
whatever weird HTML document I have.

Anyway so you might have
let's say you have a corpus.

You run the recursive chunk run on
this, you have 100 chunks then, right?

And they're tiny, 100 tiny chunks.

What I then said is okay,
cool, we now have 100 chunks.

Let's embed all of these chunks, so
that we have all their embeddings,

and let's find the similarity
between all pairs of chunks.

And again there's one of those
things where if I say out loud,

it's hard to follow along.

If anyone can see, there's an image in
the paper, you can look at, the class

semantic chunk, you may have seen it.

And it's a big matrix, and this
staircase pattern down the middle.

And it's a symmetric matrix, so you
see these blocks down the middle.

And all these blocks, they're sat on the
diagonal, they're all on the diagonal.

And what that matrix Is it just a matrix
of the similarity between any pair, right?

So the top left corner is the similarity.

Okay, the down the diagonal,
that's the similarity between

like things in itself, right?

So like the very top left
corner, there's a zero, zero.

That's saying what's the similarity
between chunk zero and chunk zero.

And then in the next 11, the similarity
between Chuck one and Chuck one.

Obviously, that's all one.

So you have ones out of diagonal.

Great.

Then you're off diagonals around it.

You're basically asking a question.

Okay what's the similarity between
this chunk and its neighbor?

Or like neighbor and sure enough,
like I literally just ran that

code, not expecting us to get, we're
not sure what I'd see, but when we

generate that matrix, you see these
blocks for me down the diagonal, and

we did this on, we did this on the
wiki articles corpus, and sure enough.

You could see actual articles, and you
can see if they appear as like a massive

block, and that's a single article.

Fundamentally, it turns out that if
you break an article, and you get the

similarity between all their embeddings,
they're all very similar, and they're

all very different to different articles.

So that's basically the logic that
starts the cluster thematic chunker.

It's wow, I can literally see it.

I can literally visualize the pairwise
similarity between all chunks.

And I can see things that are like
different contexts and different themes.

So like, why don't we try our best to
make a chunking algorithm that just tries

his best to group everything together.

That's in the same thing, right?

Try to capture these big squares and
never like breaks in between two,

like squares or like different topics.

Nicolay Gerold: Yeah, there you have
the edit part that you actually, you

can assume in most cases when it's
human written, that there is some

inherent structure in the document.

And with the semantic chunker, if you
do it, like still basically in the

flow of the document, you expect that
there are basically blocks of topics.

Which

Brandon Smith: Yes, exactly.

Exactly.

Yeah.

And yeah, exactly.

And I guess it's true.

It's very much for make doing
like chunking long corpuses

that someone's written.

And it did a good job.

It did a pretty good job.

Yes.

I was really happy with that.

And we used as well I felt pretty small.

I used a dynamic programming algorithm
to find the optimal chunking.

That's pretty neat.

A bit of LeetCode hard challenge.

Yeah, it was good.

It was good.

Nicolay Gerold: What were the
different parameters you were throwing

in the optimization algorithm?

So what were like the degrees
of freedom in the end?

Brandon Smith: Yeah,
you're completely right.

I still like, we, me and Anton originally
agreed that we wanted as little

hyperparameters as possible, right?

Because no one tunes, no hyperparameters,
no one wants to touch it, right?

So the less there are, the better
the algorithm will work because

no one's going to tune it, right?

And so I really tried, and this
is a bit stupid looking back, but

I really tried to make something
that had no hyperparameters.

But if you think about it logically.

There will always be, there
actually will always be there

has to be at least one parameter.

And there's with this idea about trying
to keep semantically similar chunks

together, the optimum strategy is to
just not everything is, every single

sentence is its own chunk, right?

Because obviously the most
semantically similar thing is itself.

So you do actually have to
balance it a little bit.

What I did The way I did it, and I think
someone could improve this algorithm.

It's just how I did it, and
it seemed to work quite well.

But I gave I first, obviously
the diagonals, I just I, I knew

I normalized the diagonal values.

The down diagonals was once,
because obviously the similarity

between a document in itself is one.

I normalized that to instead
just be the average similarity

between all pairs, right?

So that suddenly it makes
a diagonal not that strong.

And I made an algorithm that
said, okay, like basically.

You're allowed to place you
can chunk, because you can

only chunk consecutive things.

And this is another thing, I said it can
only chunk consecutive things, right?

You couldn't put one sentence at the
start and a sentence at the very end

of the document in the same chunk.

That would make no sense.

They're, like, completely opposite.

They're not consecutive, right?

So I said they can only
construct consecutive things.

That means that effectively, you're
basically putting squares along

the diagonal that always connect.

Again, it's very easy.

When you see the image of the matrix,
it's quite easy to visualize this.

It's basically you just have
squares groups along the diagonal.

And you're basically saying how big can
I make these squares, how much, how small

can I make them, along this diagonal.

So that's, that fixes another thing.

And then you have how big can a chunk be?

I said a chunk should be
no bigger than 400 tokens.

Sorry yeah, 400 tokens, 400 words.

And then Yeah, and then I just I
normalize the scores by basically

minusing the average score to
everything as well, actually, to

that yeah that just normalizes it,
because the rewards function for a

given chunk is to try, is to have
the highest sum of similarity scores.

Basically, that, that group appears as
a square on the matrix, and it takes the

sum of every matrix element inside of it.

Of course, you'd say then, the
optimal strategy would be to make

the square as big as possible.

It would be, if there were
any negative matrix values.

But there is negative matrix
values, because we subtract

the average from everything.

So if the similarity between two was
slightly less than the average, it'd

actually be negative and it'd be worse
to include it inside the chunking.

So that's the reward function.

Again, somewhat technical to say out loud.

I guess any algorithm explained
verbally is quite confusing.

You generally need some diagrams, right?

Nicolay Gerold: Yeah.

Why did you choose the 400 words or
400 tokens as a cutoff by that value?

Brandon Smith: Yes, you're right.

That is very abstract.

It's not, because we messed around a lot.

So I got a good understanding,
an intuitive understanding.

Yeah, okay.

Perhaps this is a good time to
segway in to OpenAI's mistake.

I believe OpenAI's mistake, right?

It seems from my experience, we
played around with a lot of different

token and recursive splitters.

And we found that we found in general,
around 400 to 200 token size chunks.

It does really well for recall, it's
actually really good for recall with both

the fixed retriever and token splitter.

And if you think about it
logically, and I'm saying this

because I'm a I study physics, and
we always think about extremes.

Like, whenever you think about a problem,
you're always thinking in extremes, right?

When it comes, let's say a
fixed window splitter, right?

The two extremes are like, the
window is like one character.

It's as tiny as window as it
could be, it's a single character.

You would agree that it's
performance would be horrible, right?

If it's just one character, split
the document by single words.

Because there's no context there,
it completely broke the document.

The other extreme is that
it's the entire corpus.

You set that, you tune that parameter from
zero to a million or a bajillion, right?

And then that means it just grabs
the entire document and throws it in.

Again, really horrible retrieval.

I guess recall is technically good.

But very horrible performance
overall, because efficiency, blah,

blah, blah is really bad, right?

So clearly, there is like some,
there is like some sweet spot

between zero and a million.

There is going to be some maximum.

There is going to be some
ideal width of a chunk.

If you just think about it logically, if
it's going to be bad on either side, it's

going to have to be a maximum in between.

And sure enough, we found around
200 to 400, things do really well.

And my intuition for that is,
The reason why 200 to 400 200 to

400 token chunk sizes is good.

And mind you, this is for open
AI text embedding free large.

Okay.

This is all done with open
AI text embedding free large.

If you use a different
embedding model, I don't know.

Some people use sentence transformers.

Personally, I'm not a fan.

I think OpenAI's embedding
models are great.

But yeah, for OpenAI TechEmbedding
3 Large, which is also what OpenAI

uses for their assistants, that's
what their assistants use we find 200

to 400 token windows are the best.

And my intuition for that is that
obviously, if you make the, if you make

the window too small, you lose context.

It won't get retrieved.

Your context will be lost, right?

If you make the window too big
It dilutes the information.

Suddenly you're covering too many
topics, and if you're just trying to

get a very niche topic that's within
this massive chunk, maybe that it's

some tax policy, and there's one tiny
remark by tax policy that says, by the

way, this has to be done by this date.

But there's a million other things in
the document, and you query, hey, when

does this tax policy have to be done?

Because it's such a small part of this
huge document, it doesn't get retrieved.

And in my opinion, that's like
the sort of idea of dilution.

So anyway, we find that 200 to
400 is around that sweet spot.

And that takes us to opening eyes, I
think the state they've released recently,

the parameters of their new assistance.

And man, my boss made a good point.

They were so lost in their own hubris.

They like, because they're embedding
models, of course, opening eye makes

amazing embedding models, but they
make some really like the context

windows are getting bigger and bigger.

The embedding models open the idea
of really big context windows.

You can, if you want, you can stick 4, 000
token documents into their context window.

But it doesn't mean you should, and I
think OpenAI got a bit taken away by it.

I believe, I'll pull up the stuff,
it's in the paper, that the OpenAI

systems use 800 token windows
with 400 token overlap, right?

It doesn't do so well,
it doesn't do as good.

Before they had they had 400 token
windows and 200 overlap, and that's way

better, and in my opinion, it makes sense.

If you make your windows 800
wide, that's so much information.

Obviously your chunks
should be slightly diluted.

You could have a lot of different stuff
happening, and if you're trying to get

a niche thing back it may be if you
have really bland chunks that yap and

yap on about a single topic, maybe.

But otherwise, if there's a lot
of stuff happening, and you're

sticking that all into this massive
800 token chunk, there's a good

chance it's just going to be missed.

It's just going to, the information,
the retrieval for that single tiny

bit is just going to be lost away.

And it seems, from our test, that it does.

And so that was like a blunder and open
the eyes part before they used to have 400

token windows, and that was better, like
from our statistics, it shows us better.

They should change that.

Nicolay Gerold: Yeah.

Maybe an intuition basically on
the chunk size, why it could work.

I think there are like general
writing guidelines, how

long a paragraph should be.

And it's mostly around
like 100 to 250 words.

And this basically comes
down exactly to like this.

Token range.

You also have, if you assume like 2.

5 tokens per word, this
comes down exactly to it.

And if you treat like a paragraph as a
unit of thought, which is pretty coherent,

you actually can come down to that.

Brandon Smith: yeah, absolutely.

Yeah, actually, that makes tons of sense.

Yeah, you can see it's so much this,
and yeah, I think I blundered on that.

Nicolay Gerold: Why the large overlap?

So can you maybe explain for
what the overlap is in general?

Like, why do I actually integrate
the overlap in the different chunks?

And.

Brandon Smith: Yes, it's,
yeah, sure, of course.

Yeah overlap.

Now, of course me and Anton,
we, in this project and the

idea, okay what is overlap?

Why do we have overlap, right?

The reason why we have overlap is that
these chunking algorithms they can't

even, they can't see what's inside them.

Overlap only exists for these recursive
tokens and the fixed token splitters.

And the reason why, let's take the example
of the fixed token splitter, it's quite

easy to say, is that let's say you have,
let's say you have a chunk size of four,

like I don't know, like five words, right?

And you have a sentence, you can imagine
that it's going to break that sentence,

it's going to break that sentence.

It might, the chunk might end
midway through a sentence.

And the reason why we do chunk overlap
is that if we break something by doing

chunk overlap, the next chunk starts.

a little bit behind, and so that whole
sentence, it was one that got broken

in the previous chunk, and the next
chunk, it covers it, and it picks it

up because it starts a bit before.

So it's the idea that because one chunk
might end early, the next chunk starts

early to catch whatever was broken or not.

That's why chunk overlap exists.

So fundamentally, It exists because the
trunk have no idea what's inside of it.

They can't tell.

The trunk can't tell if it broke a word
in half or if it broke a sentence in half.

So we use trunk overlap as as far as
I'm concerned, as a way to basically

deal with the fact that oops, we may
have just broke something in half.

Let's start early in the next chunk.

So it's actually, when you think
about it, quite a hacky solution.

And so with me and Anton in our
project, we wanted to the ideal trunk

should have no trunk overlap at all.

To an extent.

There really shouldn't be
any chunk overlap, right?

I say to an extent, cause sometimes if
you wanted all sorts of stuff there was

this one thing you could have let's say
someone has a paper and they had multiple

experiments, experiment A, experiment
B, experiment C, experiment D, and

let's say every paragraph, they start
with experiment C, experiment A, we did

so and blah blah blah, next paragraph,
experiment B, we did so and blah blah

blah blah, and what not One might
ideally like a chunk, which literally

just says an experiment, like just the
first sentence of each paragraph, all

in one chunk, because then somebody
asks, Hey what would experiments done?

You'd retrieve that chunk.

Whereas another person might
like a chunk of each paragraph.

In that sense, there's been chunk overlap.

Because you have you have information
being repeated, but that's quite

complicated, tricky algorithm.

And it's not quite clear how someone
would develop such an algorithm.

So we pushed that aside.

Cause that's like quite complicated.

It's basically saying let's re
chunk for a different query.

So that's like a different topic.

And especially when you're like, you're
trying to create independent You have to

ideally, you have to chunk up that problem
that ideally you'd rather not have.

And ideally, you would just have
to chunks be aware of the semantics

inside of them and split correctly
so that they can just be left apart.

So yeah that's the whole
thing on chunk overlap.

Nicolay Gerold: Yeah, one
chunk I also want to double

click on is the LLM based one.

Can you actually explain how it
works and how it basically separates

the text into separate chunks?

Brandon Smith: Yes.

So we, so we tried a few.

My, my boss originally posed the
idea and I thought that was insane.

I was like this is silly.

This isn't gonna work.

But then we thought about it.

And okay, maybe it could work.

And it was a fun experiment.

Why not?

The first one we tried was just,
was just actually just silly.

What we did originally was the
original way is that we literally

handed a massive chunk of the
corpus, like fundamentally you know,

these models have a context window.

You can't hand them that
much information, right?

So we'd hand the first 4, 000 tokens
You say, Hey, GPT, let me like GPT

40 here's like the 4, 000 first
4, 000 like tokens of a corpus.

Can you respond to me like in
quotes, say chunk one colon, and

then write a bunch of texts and
then chunk two and blah, blah, blah.

And that's, I tried that algorithm.

Now, sure enough, as you can imagine, that
had a million trillion problems, right?

The very first problem being that it
would just write its own stuff, right?

It'd say chunk one and just go
off on its own journey, right?

Nothing related to the actual document,
or it would be similar, it would

hallucinate, I I'm, like, I, before I
worked in a different, I worked in EdTech

I was quite used to having to parse
outputs from GPC, like JSON parsing,

and As a thing I used to quite often
do is like you'd validate its response.

So I validated it so I would do
an exact text search from whatever

responded with the original, to make
sure that it's actually quoting the

original when it cop writes out a chunk.

And it would never quote it.

If there was like, sometimes the grammar
in the original chunk would be a bit

dodgy, and it would fix the grammar,
or it would fix the typo, but then

suddenly that's no longer an exact copy,
and all these things, that was a mess.

And more importantly as well, if
you think about it logically, you're

saying that to chunk a document,
you have to pay the inference cost

of every token in your corpus.

That's really expensive, right?

It's 15, I think it's 15
pounds per million tokens.

Maybe it's not bad.

I don't know.

It's not good.

Paying an inference cost for every
single token in your document like

of GPT isn't really a good idea.

So we dropped that and instead we
made a much nicer system, which

was quite smart, which was What
I did is, I basically, I stuck a

bunch of tags inside the documents.

So I, and again easier
seen than explained.

You can see, if you look at the PDF or
the paper, you can see in the element

semantic chunker, what the text looks
like after the tags have been added.

So it literally starts with like
open angle bracket, chunk zero,

close angle bracket, and then there's
some text, which might be some

sentence, blah, blah, blah, blah.

And then open angle bracket,
end off the first like bit and

then the start of the next.

And then because of this, I could then
just actually ask you, pt, I can say,

Hey, read through all these like like
groupings of the text and say what?

To stay together.

So like maybe the first three, three
groups should stay together and why?

And they have to be consecutive.

So in fact, actually you
just tell me when to split.

So they might read like everything
that's in the first grouping.

Everything that's in the second
grouping, in the third grouping.

And say it's just split off to free.

And so then we just do that together.

And so by doing that, GPT can literally
just transform with 3, 6, 7, 8, and then

it'll also actually give an explanation.

It'll be like, I decided to split
at 3 because blah blah blah blah.

It turns out, I didn't see it because
I was just parsing at the JSON, but

I had it print the output one time.

It turns out it yaps a bunch more.

But anyway so it does that.

And that's how it works.

And honestly, quite
surprisingly, it was pretty good.

It actually performed quite well.

Nicolay Gerold: Yeah, so in the end,
like the semantic one and the LLM one

is basically we split up the text.

It's separated into like chunks already,
like into really small pieces like atoms.

And then you basically merge it
back together based on, for one, the

similarity, and in the other one,
based on whatever the LLM predicts.

Brandon Smith: Yes.

A hundred percent.

Yeah.

So similar ideas.

They're actually very similar ideas.

Yeah.

In that sense.

Nicolay Gerold: Nice.

What were the other chunking methods
you would highlight from the paper?

Brandon Smith: So my, Greg, Kamrats.

I'm really sorry if I mispronounced
his surname, but Greg, really cool guy.

I had a call with him.

He is the guy who pretty much
started this whole thing.

He had that notebook that a
lot of people saw the like five

levels of semantic trunking.

He made a chuck here himself
and here's what's cool.

His work by having like a rolling window.

And we implemented that.

Unfortunately, like mind you look,
Greg made this very clear, he did

this whole thing as a full experiment.

He didn't like rigorously
try to fine tune everything.

And this chunk performed very poorly
on our metrics, but the reason

why it performed poorly is that it
produces really large chunks, because

it was never really it produced
chunks of over 1, 300 tokens.

And the bigger the chunk size, we've
mentioned this before, the worst

it will perform in Germany, right?

If they're too big.

And that's because, like, when he wrote
this algorithm, he didn't realize,

that it's actually, it matters a
lot that you keep your chunks small.

It's at least, in my opinion,
it's a restriction of the

embedding model, actually.

That the chunk should be at
a certain size, otherwise the

retrieval gets a little bit bad.

So yeah so we used, yeah, so we tried
a Gregs chunker, and it was alright.

But it's a very important chunker.

And we actually made an improvement to the
chunker, where we have a parameter into it

that says, hey, chunks can't be any bigger
than this amount, we said 400 tokens.

And it actually improves
it by quite a bit.

That was another chunker, pretty cool.

And there was also another chunker, one
like from it was like a pine cone chunker.

We tried that and they were pretty good.

I would have liked to include it in the
paper, but there was stuff happening.

And there was some other
details we had to get sorted.

Yeah.

Nicolay Gerold: And what were the chunkers
that didn't make it into the paper?

So the different mechanisms you've
tried and that blatantly failed?

Brandon Smith: Yeah, good question.

What do you think?

There's definitely a few.

There's semantic chunking.

It's been a while, so we didn't
chunk it quite a while back.

Sorry, let me think.

What chunk did we try that failed?

Was it LLM Semantic Chunker?

Yes!

Okay, sorry.

Damn, I'm completely blinding.

Context site.

We did, man, we did a lot.

In the abstract we have some
of the failed chunk algorithms.

And there was like this, there
was a paper called Context Site.

Really cool paper.

And they what they try to do is that
whole paper was to do with sighting,

which is given a rag pipeline, how
can you sites yeah, they got a touchy.

Okay.

Yeah.

So they basically, they had a paper
where they were like, it's the

whole idea of it was to like help
sites where like your your LLM

response comes from in the prompt.

And I tried to write some
nice algorithm that uses that.

And that says, okay that was basically
like, it was basically like trying to

use the algorithm that they designed
a really nice algorithm that paper,

it was trying to do that to find how
relevant each chunk was to each other.

That was one thing it's, it was
somewhat of a messy implementation.

Most importantly.

The whole thing failed massively, because
it got really confused by Japanese.

There was some Japanese text,
and it just seemed to throw

off everything really badly.

Because it all blew up.

And honestly, I'm actually quite curious
to know if it would I honestly think that

might have an effect on the normal paper.

The context side people
should give it a shot.

It seems like language has
thrown off the whole algorithm.

Yeah.

And then we also tried logic
and attention based junking.

And, mind you, like I'm not going to lie,
this was very much experimental, right?

This is one of those things where you
go in just being a little bit curious,

but you don't really have a plan, and
sure enough, nothing works, right?

So I tried making I tried I tried using
an already good chunker, like a fixed

token chunker, or one of our cluster
semantic chunkers, and chunking, and

trying to see if we then pass if we pass
the entire corpus through LLAMA3, and

see what the attention models look like.

I had the attention values.

And like the logits, and I'm trying to
see, okay, is there any like signal here?

Okay for example, you can
take the logits, right?

At any given point, you can
see the probability of a

word being generated, right?

By just running it through Llama.

And so you might be like, okay,
is there any signal on that?

Do we see that actually suddenly the
probability gets really low when it

changes from one theme to another.

And the answer is, yeah.

A little bit, but not really.

It's hidden in the noise.

Think about it.

The start of the next sentence,
you could say anything, right?

The probability of the next in the start
of the next sentence, it's really high.

Or maybe it's not that high,
and you start with a the.

But then after the, you
could say so many things.

The tree, the cat, the man, the person,
the block, the paper, whatever, right?

And that was like, all noise.

And it just threw the whole thing off.

It like, you couldn't find the
semantic changes within the noise.

It's just how actually
quite chaotic languages.

But it's in our, it's in our appendix.

If anyone's curious and there's
some cool pictures in there.

Nicolay Gerold: Yeah, I've
tried something similar before.

We basically tried to model the
probability distribution of the logits

Brandon Smith: Oh, really?

Nicolay Gerold: It's it was chunky.

It's I had to do so many different
assumptions and stuff like that,

and it didn't work so well.

I think the logits aren't really
optimized for like themes.

They're optimized for basically a
single token output and not for the

modeling, the entire sentence, for
example, based on which I basically have.

Brandon Smith: yeah, exactly.

Yeah, that's it.

Because as well it's very
quickly jump back, right?

If the moment the sensor starts
getting onto something, immediately,

it'll just lock into something, right?

And so it's yeah, you could just, you
can never It was hard to say it like

it never got confused this idea that oh
actually the tokens are like there'll

be high entropy tokens where like it's
given the probability to tons of different

tokens no it doesn't like it very
quickly locks into some theme but yeah

Nicolay Gerold: Yeah.

If we would have infinite time resources
at the end of the study, what chunking

method would you have loved to have tried?

Brandon Smith: I mean I would have
liked I would have liked to think longer

and harder about the very fundamental
first thing we said which is that

the perfect chunking algorithm would
take a query and find the relevant

tokens amongst the document that
can actually that answer that query.

That's actually talking.

I wasn't like, it's totally true.

Choreography.

Great.

But ultimately, that's
actually what matters the most.

It's usually one of the
right stuff for the query.

Yeah, I would have liked to work on that.

There's something that I said, Colbert
have some work on that and it's

something you could do with that.

Maybe that's a sliding window
technique where you like slide.

That's how it works anyway.

Yeah.

But other times even just
thinking about it now, it's just

putting my head into a spin.

There's a lot to think about, because as
well a lot of times it's disconnected.

Again if the question is can you get me
the experiment names of every experiment

in this paper, you have to retrieve a
bunch of disconnected pieces, versus can

you tell me what's its exp give me the
third sentence of the first paragraph.

That's a very weird question, but
someone could reasonably ask that.

All these different things.

But yeah it'd be, I think it'd be
nice, and someone should it's very

important that one day, and this
will happen eventually, that we

have a chunker that can do that.

It's not a chunker, like a whole
RAG system that can really retrieve

the exact tokens, because that's
what fundamentally matters.

That's something I would like
to look into, but it's just hard

to see how you can tackle that.

Nicolay Gerold: Yeah.

And I think this goes more into the Entire
search literature or the traditional

information retrieval literature again,
they actually spend way more time actually

thinking about what are the different
query types you will encounter and what

are the documents I have, which documents
fit to which document to which query type,

and then you basically try to deconstruct
the entire search system and ensure

that it matches in the most optimal way.

And I think yeah.

In Rack, we basically
threw all of that away.

We embed the documents.

We embed the query, and we hope
something good comes out of it.

Brandon Smith: Yeah, right?

Yeah, no, exactly, you're right.

And it's true, it's a massive, it's
a massive field that people have

worked in for a long time, right?

And, yeah, I think it's true,
we really should spend more

time revisiting old literature.

Ciao.

Nicolay Gerold: Yeah.

You already mentioned that you threw
away one chunking algorithm or method

you tried because of the language.

Are there any additional language
specific considerations for the

chunker people should be aware of?

Brandon Smith: Let me think.

Generally, here's what I'd say.

I'd say, with most chunking algorithms,
and even the ones you threw out, the issue

was when we had a mix, where we're in the
same corpus, we had Japan this one is,

we had Japanese appearing, in the amongst
a bunch of English texts, and the issue

with that is that it seems like And I know
these embedding models, they try not to

do this, but it seems like they are still,
they still treat it a bit separately.

They're like, this Japanese
and fundamentally it's true.

A Japanese word is very
different from an English word.

Even if the words are about the same
topic, there's still one axis, language,

where they're very different, right?

I wouldn't say that the word, whatever,
I don't know, the Japanese word for a

tree is, and the English word for tree.

Yeah, they should be somewhat similar.

They should be closer apart.

They should be close together than I don't
know, maybe tree and I don't know, car.

But at the same time,
they're not the same.

They are two different.

They are words in different languages.

So in the sense of that, in the language
axis, if there is like a dimension

in the embedding model, which is the
language, they're very far apart, right?

So anyway, so yes, you will
find that there's going to be

some disconnect in similarity.

But if you have multiple languages
in the same document, I think if

they're all the same language, none
of these problems should appear.

If it's all the same
language, then it's fine.

That is eliminated.

Yeah, I haven't, it's true.

I haven't tested actually
our cluster, MATLAB chunker

amongst different languages.

I think the LLM chunker would do just
fine because it's a large language

model and it's a smart machine.

Nicolay Gerold: Yeah what was a thought
experiment a while ago from mine was

thinking about trying to bring or
reduce the different words of Latin

based languages back to the Latin
equivalent and training embedding

model on the Latin equivalent.

This is something I would
love to try at some point.

Brandon Smith: That'd be

Nicolay Gerold: Because I think
it, the embedding when it gets

more robust across the different
Latin languages that give you.

Take any of those Spanish, English,
German, all have heavy Latin influences,

and if you basically trace back the
origin of the words and embed that this

could be an interesting thing to try.

It would probably, it wouldn't
be worth the effort, but it

would be like a fun research

Brandon Smith: Everyone does perfect.

Yeah.

There's philanthropic things you
do when you make tons of money.

Nicolay Gerold: Yeah.

When you look ahead or during your
research, did you find any research

areas that you think are very promising
or very interesting for the future of

like document chunking or rack systems?

Brandon Smith: I think we're happy
with, we're happy with the cost

of matching chunkers is good.

The LLM chunkers is good.

They can be a little bit slow though.

I think fundamentally to be
honest, I actually think this is

a pretty close subject, right?

It's this is chunking
and it's the first part.

I think we showed a good
bunch of light on that.

I think really though, what it
highlights is that people should

think about chunking and ideally.

I think somebody, I'm hoping someone
out there can start thinking about

the ideal token retrieval, right?

That's really the best thing.

That would be a future line of research.

We haven't actually made a dent in that
cause it's just, it's fundamentally

very different given the query.

Can you retrieve the
exact tokens in document?

How can you do that?

I don't know.

There's a lot you could do, but
it's just very different from

what everyone's been doing now.

And like as well, the entire vector
database kind of goes out the

window when you do that, right?

I'm not really sure it
depends how you implement it.

I'm not sure, but it'd be very tricky.

That's, I think, a very
promising line of research.

How you could tackle it, I'm not sure.

I think it fundamentally
needs to be done, and it will

eventually be done at some point.

Nicolay Gerold: Yeah.

And if people want to basically get
in touch with you, read the research

papers, hire you, hire Chroma, where
can they get in touch with you?

Brandon Smith: Yeah, that's true.

You can get me on LinkedIn, Brandon Abreu
Smith, or go to my Twitter actually.

What's my Twitter name?

Brandon Starxel that's
Brandon S T A R X E L.

It's a bit random.

That's just my code name on there.

Yeah.

You can get me on either of those.

I love it.

Come like my LinkedIn posts

Nicolay Gerold: So what can we take away
when we are building rag applications?

So current chunking methods
assume text to be continuous.

Which is a correct
assumption in most cases.

Text flows naturally, ideas build
on top of each other, sentences

combine into ideas, and words
link together to form meaning.

But most chunking methods
actually even break this flow.

So for example, you have the sentence,
the example Brandon gave John buys a

red Ferrari, it is the fastest in town.

When you split by sentence, You completely
lose the connection, and especially the

second sentence loses a lot of context.

What does it even refer to?

And current solutions try to
patch it by overlapping chunks.

So basically, if we split by sentence,
we might include the sentence

prior and the sentence after.

And hope that the context is maintained.

But this brings a lot of its own problems.

First of all, we store the
same text multiple times.

We have to use more compute power.

And also, it makes our
database messy with duplicates.

You might only retrieve chunks.

because of the included overlap
and not because the chunk itself,

without the overlap, would actually
contain something meaningful and

important for the user's query.

And also, we have context that
could be completely out of the flow.

Of course, it can be in the chunk
above, but it could also be in a

different document, or it could
actually be in the head of the

person who has written the text.

Thanks.

So overlap is really an
imperfect solution for trying to

contextualize the information.

And we actually need better solutions
to do this contextualization

and also to do the chunking.

And the solutions for that could be
something like contextual retrieval.

We could use LLM based chunking
which is in the end, a form

of the contextual retrieval.

or as a quick and dirty solution you
can use the overlap but you should be

careful with that if you implement it
make sure that the context is actually in

a flow and is not coming from somewhere
else and one interesting thing actually

which is my theory why the chunk size
works for 200 to 400 chunks or 400 tokens

sorry it's In most writing guides, and
this is followed by most writing, is

that a paragraph is 100 to 250 words
and one paragraph handles one idea.

And this really matches well with
the, what the chunking research

actually says, which chunk sizes
work best, like 200 to 400 tokens.

And I think this is actually the sweet
spot because if you treat a paragraph as

a unit of thought, this really lines up
that you take a chunk size which aligns

with that unit and it just makes it
more likely that you have a semantically

coherent and contextually relevant chunk.

So when the chunks are too big,
They dilute meaning and when the

chunk size gets too small you
basically break up the context too

much that you lose the information.

The two main solutions I think he
came up with is for one the semantic

chunking and in semantic chunking you
split the text into smaller segments.

I often like to do sentences,
and then you create embeddings

for each segment and measure how
similar segments are to each other.

And then you group related
segments into coherent chunks.

And you could do that unstructured,
so basically independent of

the order of the segments.

What I like to do is actually
use a rolling window.

So usually I have a minimum chunk size
and a maximum chunk size and I slide over

the document and measure the similarity.

And once the similarity to the
next sentence basically is below a

certain threshold, I basically say,
okay, we have to create a new chunk.

And I basically build on the assumption
that the authors spent some thought on

the in the order of the document and that
it's not like completely independent.

And through that, I basically try to
find natural topic boundaries in text

to see like when one topic ends and
another begins, I basically create

two different chunks and I can avoid
a little bit of the overlap issue.

The second solution he found
was the LLM based chunking.

So you basically teach the machine
to think a little bit like an

editor and you use a language model
to read the text and decide where

it naturally breaks into parts.

So the LLM marks potential
breakpoints with a special token.

Then you ask the LLM to
evaluate these points.

And then you have basically a
decision which sections should stay

together and which should be split.

And you use these decisions
to create the final chunks.

And this really banks on the
idea that LLMs actually can

understand context and meaning.

And I think these, this method
really has evolved into the

contextual retrieval by Anthropic.

Where you On top of basically
the chunking, you add back in

the contextual information.

So you basically have the entire document.

You look at the chunk, feed it
into DLLM and ask it, Hey, what's

missing here from the context of
the document that I should insert?

And that's it.

When I use it, I don't just feed
in the prior document, but also

important definitions for the
field, which might diverge from

like the knowledge base of the LLM.

So usually you have different
definitions of something, depending

who you're asking or who you're
talking to, or what field we are in.

And you usually also include
these definitions to basically

make it contextually relevant.

If we had a legal use case recently.

where we also included a lot of like
legal information basically like

different paragraphs and stuff like
that to also contextualize it even more

and through that you basically try to
create like semantically different chunks

but still try to get chunks that are
completely self contained and contain

all the information you might need
to answer questions about that chunk.

And I think this is a
really interesting approach.

I think we are evolving into a future
where the size of the information

isn't the driving factor, but for
that we also need better embedding

models that can handle longer chunks.

And I think we have seen one come
out recently, which is modern bird.

Which performs better on
longer context windows.

And if we get there, we
actually can be more flexible.

in the chunk length so we can actually
create chunks of different sizes or with

different size context windows and focus
more that they are semantically self

contained and handle only one topic.

So to recuperate a little bit
Best practices for now is like

keeping the chunk size in the
sweet spot of 200 to 400 tokens.

This is something we
already heard by podcasts.

I will link to it in the show notes.

And also figure out a way to
keep the chunks semantically

coherent and self contained.

If you want to hear more about
that, I have had a recent

episode with Max Buckley.

We really went deep into
that on how we can do it.

So yeah, that's it for the episode.

If you liked it, Let me know below.

Leave a leave a review.

It helps us out a lot.

Otherwise we will continue next week.

I wish you a happy new year.

Otherwise, I will catch you soon.

See ya.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#037 Chunking for RAG: Stop Breaking Your Documents Into Meaningless Pieces

Subscribe