· 49:13
Nicolay Gerold: The biggest lie in
RAG is that semantic search is simple.
The reality is that it's easy to
build, it's easy to get up and running,
but it's really hard to get right.
And if you don't have a good setup,
it's near impossible to debug.
One of the reasons it's really
hard is actually chunking.
And there are a lot of
things you can get wrong.
And even OpenAI boggled it a
little bit, in my opinion, using
an 800 token length for the chunks.
And this might work for legal, where
you have a lot of boilerplate that
carries little semantic meaning,
but often you have the opposite.
You have very information dense content
and imagine fitting an entire Wikipedia
page into the size of a tweet There will
be a lot of information that's actually
lost and that's what happens with long
chunks The next is overlap openai uses
a foreign token overlap or used to And
what this does is actually we try to
bring the important context into the
chunk, but in reality, we don't really
know where the context is coming from.
It could be from a few pages prior,
not just the 400 tokens before.
It could also be from a definition
that's not even in the document at all.
There is a really interesting solution
actually from Anthropic Contextual
Retrieval, where you basically pre
process all the chunks to see whether
there is any missing information and
you basically try to reintroduce it.
So today we are back continuing
our series on search.
We are talking to Brandon Smith,
who is a research engineer at
Chroma, the vector database.
And he actually led one of the
largest studies in the field.
on different chunking techniques.
So today we will look at how we can
unfuck our rack systems from badly
chosen chunking hyperparameters.
Let's do it!
So what do you think what's
the motivation for chunking?
why do we actually need it?
Brandon Smith: That's a good question.
To be honest with you, I do genuinely
believe chunking is something we do we do
chunking because we have embedding models.
And in a better world, we wouldn't
actually chunk at all, right?
In a better world, we would extract
the tokens that matter for a given
query and bring them back perfectly.
But that's not what happened.
It turns out we can make
some very powerful embedding
models that are very strong.
But they obviously, of course,
need they need an input string.
You have to give them an input string.
And they need context as well, actually.
I guess it's Like fundamentally,
you have a query, and you have some
passages in the text that you want
to retrieve for that given query.
And of course, the way we know to do this
is that we embed the simple way you embed
the query, and then you embed a bunch
of chunks of the documents, and then you
retrieve that chunk for the given query.
Ideally a better system in theory,
would actually take the query and
really read along the entire, like
if you gave it to a person, they take
the query and they'd read along the
entire corpus with that query in hand
until they see something that matches
and they'd go great, this is good.
These two match.
But what we've actually said is when we
like embed our corpus, we just do it once.
It doesn't matter how the query
changes, at least in general, in
standard RAG you just do it once.
And we do that because
it's really efficient.
We embed the entire corpus
and then every query, they get
their own separate embedding.
And we just do a little outer
product between them whenever.
And really, fundamentally,
we chunk because it's just
efficient and it's quick, right?
A better way.
And as I said, this is, I
believe how Colbert works.
My boss also did some papers on
that on Anton is you can actually.
It's not like you just chunk embedding.
The process is done together.
You hand the query and like a
chunk and the two then make the
similarities bound together.
You can't just have their
vectors held in place.
But yeah, to summarize the
question, why do we chunk?
I'd say it's because
it's efficient and quick.
That's why we chunk right now.
Nicolay Gerold: Yeah, basically avoids
to re embedding the entire corpus
we have, which can be like millions
of documents on retrieval time.
Brandon Smith: absolutely.
Exactly.
And yeah, and it adds latency, right?
We want this to be quick.
If you're using a chatbot particular, you
don't want the user waiting while it re
embeds everything to get something back.
So yeah.
Nicolay Gerold: Yeah, and we basically
trade off the accuracy for the speed
of retrieval, but we can Basically add
it back in by using rerankers, just
basically doing the stuff you said just
after you retrieve like a candidate set.
And what is the main issues
with current methods?
I think we have seen so many different
chunkers, especially in Langchain,
like the markdown one, HT ML splitters,
but I think most people use like the
standard one just based on length
Brandon Smith: Yeah.
A hundred percent.
Exactly.
I think the current issue with
the issue of chunking these before
our paper was that no one knew.
No one knew it was better.
There's actually a lot to be said.
There's a whole thing that I'm
going to make some posts about,
which I think is really interesting.
No one knew what was better.
Maybe there's a bit of a bold claim to
make, but I believe OpenAI themselves
also didn't know what was better
because there's a chunking strategy
they use that wasn't very good.
And we'll get into that.
But yeah, no one knew what
chunking method is better because
there's no way to compare it.
There's no metric.
There's not really a metric.
Of course you could like do the
standard RAG metrics and test the
entire pipeline, but you're not
really isolating the chunker, right?
You ideally like to isolate
the chunker and say, okay let's
just compare the chunkers.
And so that was the current issue
is that yes, as you say, there's
no way to see which is better.
Fundamentally one could also, okay.
What would we imagine a good
chunking strategy would be?
And it's okay what is it
at the end of the day?
There are two things that matter.
Really, all that matters is that
for a given query, we retrieve
exactly the relevant information.
That's true.
That's actually it for any given query.
You want to receive only the tokens
that relate to that exact query.
That will be the ideal
best entire RAG pipeline.
In practice, we don't have that.
Extra stuff come along.
So you say, okay, fine, extra stuff
could come along, but we want as little
extra stuff as possible to come along.
And a chunker, that means
tons of extra stuff happens.
It's a very bad chunker, right?
So you can say the most trivial
chunking algorithm will be one that
doesn't chunk at all and literally
hands the entire corpus over.
That's a theory, like a possible solution.
It's horrible, though, because you get
tons and tons of users to use the chunks.
Of course, then we have what became
really popular, are just the fixed,
like the fixed token splitter
and the recursive token splitter.
And they seem very sensible at first.
I think the fixed token splitter
is the most natural next step.
It's okay let's just split
it into fixed lengths.
And then we chunk all of these.
And yeah, it's okay.
But that could break right through a word.
It's not ideal.
And then I could go into this.
Okay, I'll go into this.
I'll hop into this, right?
Basically then okay, now we're
in on the idea that we want to
break the text into pieces, right?
We have now decided that we're
going to make this system.
We will actually embed
everything one time, right?
Because we mentioned at the start that the
ideal system wouldn't actually do that.
It would do a bunch of other stuff, and
it would be specific on the query, but
because to keep the latency high, we're
just going to embed everything once, and
we just want to find, and to be honest,
our paper mentions that the focus really
is that, Given that we want to just embed
everything once, what is the best chunker?
Because people have mentioned
actually, if you re embed the stuff,
you can maybe do something better.
That's not really the focus.
Given you only embed everything once what
is the best chunker you could do there?
Fundamentally obviously, we
wanted to retrieve only the
specific tokens for a given query.
But, it's a practical world.
It's a real world.
The embedding model has to have enough
context inside that chunk to be able
to bring it back to the query, right?
For example what might like, I don't
know, there might be like, let's say, as
an example of there's a passage and it's
about some guy, john, and he said, john
buys the far john buys a red Ferrari.
Full stop.
It is the fastest in town dot.
And somebody queried and was like
how, like, how fast is John's car?
Let's say our chunking method just
literally broke the, it is the
fastest in town and made that its
own individual chunk that would
actually never get retrieved because
it's got nothing to do with a car.
It just said, it's the fastest in town.
You've actually completely
lost all information by just
breaking that into its own chunk.
You just completely made it useless.
You've just deleted it.
So there's even though that's
exactly the right answer, a person
could see would just pull that thing
saying it's the fastest in town.
The embedding model has no hope of
ever retrieving it because we've
just, we've broken information,
we've removed the flow of things.
So that's that's when it comes to
breaking things down like that.
Fundamentally, I guess my point
with that is that you have to keep
the context within the chunks.
The chunks have to contain context.
That's what matters.
And yes, with all these different
tracking methods, ideally you have
something that has context within it.
I'm going on a bit of a tangent
here and I feel like I'm covering
a bunch of other questions.
So feel free to break me.
Maybe we should re orientate ourselves
before I go off that Dallas pipeline.
Nicolay Gerold: and no, I like it.
So I, how this is, was.
One thought I would love to get your
thoughts on this as well was my idea
of MECE chunks, like mutually exclusive
but completely exhaustive, which is
something I'm like try to get to, but in
the end what I landed on is like one of
the chunkers, which you also basically
implemented in the paper, which is more
of the semantically oriented chunker,
that you try to basically make the chunks
as semantically coherent as possible.
Can you maybe start with these?
What can it do?
How does it work?
Brandon Smith: Yes.
A hundred percent.
Yeah.
I thought a lot about that.
Yes.
Because it really comes back from the,
that, that initial thought experiment,
the one with the guy had the fast car,
it's the fastest town, whatever, right?
The chunk, you could look at the chunk.
It was the fastest in town.
If that was just the entire chunk.
And immediately you can say
this is missing context.
There's clearly missing context.
There's a pronoun here.
I don't even know what
word it's about, right?
You could almost look at the chunk itself
and see if it's a bad chunk by being
like this is just incoherent, right?
So we can, you can
criticize the chunks itself.
Ideally, what you want is a
chunk that produces chunks that
every chunk is super coherent and
it's completely self contained.
And that's what we did with the The
cluster semantic chunker and mind you
actually, I won't lie that's one of those
chunkers where I think, I don't know
if you saw the paper, there's a picture
of there's a picture that I think makes
that chunker so clear and so obvious.
And that's actually how I derived, I
broke the thing in the first place.
What I did was I started off with a
recursive a recursive token chunker.
And I just split the document by 50
tokens, which is like very short, right?
50 tokens is like what?
Slightly under 50 words.
It's like a sentence, maybe two sentences.
And I did that to, I wanted to split
by a sentence, but you never know if
there's no sentences, blah, blah, blah,
whatever weird HTML document I have.
Anyway so you might have
let's say you have a corpus.
You run the recursive chunk run on
this, you have 100 chunks then, right?
And they're tiny, 100 tiny chunks.
What I then said is okay,
cool, we now have 100 chunks.
Let's embed all of these chunks, so
that we have all their embeddings,
and let's find the similarity
between all pairs of chunks.
And again there's one of those
things where if I say out loud,
it's hard to follow along.
If anyone can see, there's an image in
the paper, you can look at, the class
semantic chunk, you may have seen it.
And it's a big matrix, and this
staircase pattern down the middle.
And it's a symmetric matrix, so you
see these blocks down the middle.
And all these blocks, they're sat on the
diagonal, they're all on the diagonal.
And what that matrix Is it just a matrix
of the similarity between any pair, right?
So the top left corner is the similarity.
Okay, the down the diagonal,
that's the similarity between
like things in itself, right?
So like the very top left
corner, there's a zero, zero.
That's saying what's the similarity
between chunk zero and chunk zero.
And then in the next 11, the similarity
between Chuck one and Chuck one.
Obviously, that's all one.
So you have ones out of diagonal.
Great.
Then you're off diagonals around it.
You're basically asking a question.
Okay what's the similarity between
this chunk and its neighbor?
Or like neighbor and sure enough,
like I literally just ran that
code, not expecting us to get, we're
not sure what I'd see, but when we
generate that matrix, you see these
blocks for me down the diagonal, and
we did this on, we did this on the
wiki articles corpus, and sure enough.
You could see actual articles, and you
can see if they appear as like a massive
block, and that's a single article.
Fundamentally, it turns out that if
you break an article, and you get the
similarity between all their embeddings,
they're all very similar, and they're
all very different to different articles.
So that's basically the logic that
starts the cluster thematic chunker.
It's wow, I can literally see it.
I can literally visualize the pairwise
similarity between all chunks.
And I can see things that are like
different contexts and different themes.
So like, why don't we try our best to
make a chunking algorithm that just tries
his best to group everything together.
That's in the same thing, right?
Try to capture these big squares and
never like breaks in between two,
like squares or like different topics.
Nicolay Gerold: Yeah, there you have
the edit part that you actually, you
can assume in most cases when it's
human written, that there is some
inherent structure in the document.
And with the semantic chunker, if you
do it, like still basically in the
flow of the document, you expect that
there are basically blocks of topics.
Which
Brandon Smith: Yes, exactly.
Exactly.
Yeah.
And yeah, exactly.
And I guess it's true.
It's very much for make doing
like chunking long corpuses
that someone's written.
And it did a good job.
It did a pretty good job.
Yes.
I was really happy with that.
And we used as well I felt pretty small.
I used a dynamic programming algorithm
to find the optimal chunking.
That's pretty neat.
A bit of LeetCode hard challenge.
Yeah, it was good.
It was good.
Nicolay Gerold: What were the
different parameters you were throwing
in the optimization algorithm?
So what were like the degrees
of freedom in the end?
Brandon Smith: Yeah,
you're completely right.
I still like, we, me and Anton originally
agreed that we wanted as little
hyperparameters as possible, right?
Because no one tunes, no hyperparameters,
no one wants to touch it, right?
So the less there are, the better
the algorithm will work because
no one's going to tune it, right?
And so I really tried, and this
is a bit stupid looking back, but
I really tried to make something
that had no hyperparameters.
But if you think about it logically.
There will always be, there
actually will always be there
has to be at least one parameter.
And there's with this idea about trying
to keep semantically similar chunks
together, the optimum strategy is to
just not everything is, every single
sentence is its own chunk, right?
Because obviously the most
semantically similar thing is itself.
So you do actually have to
balance it a little bit.
What I did The way I did it, and I think
someone could improve this algorithm.
It's just how I did it, and
it seemed to work quite well.
But I gave I first, obviously
the diagonals, I just I, I knew
I normalized the diagonal values.
The down diagonals was once,
because obviously the similarity
between a document in itself is one.
I normalized that to instead
just be the average similarity
between all pairs, right?
So that suddenly it makes
a diagonal not that strong.
And I made an algorithm that
said, okay, like basically.
You're allowed to place you
can chunk, because you can
only chunk consecutive things.
And this is another thing, I said it can
only chunk consecutive things, right?
You couldn't put one sentence at the
start and a sentence at the very end
of the document in the same chunk.
That would make no sense.
They're, like, completely opposite.
They're not consecutive, right?
So I said they can only
construct consecutive things.
That means that effectively, you're
basically putting squares along
the diagonal that always connect.
Again, it's very easy.
When you see the image of the matrix,
it's quite easy to visualize this.
It's basically you just have
squares groups along the diagonal.
And you're basically saying how big can
I make these squares, how much, how small
can I make them, along this diagonal.
So that's, that fixes another thing.
And then you have how big can a chunk be?
I said a chunk should be
no bigger than 400 tokens.
Sorry yeah, 400 tokens, 400 words.
And then Yeah, and then I just I
normalize the scores by basically
minusing the average score to
everything as well, actually, to
that yeah that just normalizes it,
because the rewards function for a
given chunk is to try, is to have
the highest sum of similarity scores.
Basically, that, that group appears as
a square on the matrix, and it takes the
sum of every matrix element inside of it.
Of course, you'd say then, the
optimal strategy would be to make
the square as big as possible.
It would be, if there were
any negative matrix values.
But there is negative matrix
values, because we subtract
the average from everything.
So if the similarity between two was
slightly less than the average, it'd
actually be negative and it'd be worse
to include it inside the chunking.
So that's the reward function.
Again, somewhat technical to say out loud.
I guess any algorithm explained
verbally is quite confusing.
You generally need some diagrams, right?
Nicolay Gerold: Yeah.
Why did you choose the 400 words or
400 tokens as a cutoff by that value?
Brandon Smith: Yes, you're right.
That is very abstract.
It's not, because we messed around a lot.
So I got a good understanding,
an intuitive understanding.
Yeah, okay.
Perhaps this is a good time to
segway in to OpenAI's mistake.
I believe OpenAI's mistake, right?
It seems from my experience, we
played around with a lot of different
token and recursive splitters.
And we found that we found in general,
around 400 to 200 token size chunks.
It does really well for recall, it's
actually really good for recall with both
the fixed retriever and token splitter.
And if you think about it
logically, and I'm saying this
because I'm a I study physics, and
we always think about extremes.
Like, whenever you think about a problem,
you're always thinking in extremes, right?
When it comes, let's say a
fixed window splitter, right?
The two extremes are like, the
window is like one character.
It's as tiny as window as it
could be, it's a single character.
You would agree that it's
performance would be horrible, right?
If it's just one character, split
the document by single words.
Because there's no context there,
it completely broke the document.
The other extreme is that
it's the entire corpus.
You set that, you tune that parameter from
zero to a million or a bajillion, right?
And then that means it just grabs
the entire document and throws it in.
Again, really horrible retrieval.
I guess recall is technically good.
But very horrible performance
overall, because efficiency, blah,
blah, blah is really bad, right?
So clearly, there is like some,
there is like some sweet spot
between zero and a million.
There is going to be some maximum.
There is going to be some
ideal width of a chunk.
If you just think about it logically, if
it's going to be bad on either side, it's
going to have to be a maximum in between.
And sure enough, we found around
200 to 400, things do really well.
And my intuition for that is,
The reason why 200 to 400 200 to
400 token chunk sizes is good.
And mind you, this is for open
AI text embedding free large.
Okay.
This is all done with open
AI text embedding free large.
If you use a different
embedding model, I don't know.
Some people use sentence transformers.
Personally, I'm not a fan.
I think OpenAI's embedding
models are great.
But yeah, for OpenAI TechEmbedding
3 Large, which is also what OpenAI
uses for their assistants, that's
what their assistants use we find 200
to 400 token windows are the best.
And my intuition for that is that
obviously, if you make the, if you make
the window too small, you lose context.
It won't get retrieved.
Your context will be lost, right?
If you make the window too big
It dilutes the information.
Suddenly you're covering too many
topics, and if you're just trying to
get a very niche topic that's within
this massive chunk, maybe that it's
some tax policy, and there's one tiny
remark by tax policy that says, by the
way, this has to be done by this date.
But there's a million other things in
the document, and you query, hey, when
does this tax policy have to be done?
Because it's such a small part of this
huge document, it doesn't get retrieved.
And in my opinion, that's like
the sort of idea of dilution.
So anyway, we find that 200 to
400 is around that sweet spot.
And that takes us to opening eyes, I
think the state they've released recently,
the parameters of their new assistance.
And man, my boss made a good point.
They were so lost in their own hubris.
They like, because they're embedding
models, of course, opening eye makes
amazing embedding models, but they
make some really like the context
windows are getting bigger and bigger.
The embedding models open the idea
of really big context windows.
You can, if you want, you can stick 4, 000
token documents into their context window.
But it doesn't mean you should, and I
think OpenAI got a bit taken away by it.
I believe, I'll pull up the stuff,
it's in the paper, that the OpenAI
systems use 800 token windows
with 400 token overlap, right?
It doesn't do so well,
it doesn't do as good.
Before they had they had 400 token
windows and 200 overlap, and that's way
better, and in my opinion, it makes sense.
If you make your windows 800
wide, that's so much information.
Obviously your chunks
should be slightly diluted.
You could have a lot of different stuff
happening, and if you're trying to get
a niche thing back it may be if you
have really bland chunks that yap and
yap on about a single topic, maybe.
But otherwise, if there's a lot
of stuff happening, and you're
sticking that all into this massive
800 token chunk, there's a good
chance it's just going to be missed.
It's just going to, the information,
the retrieval for that single tiny
bit is just going to be lost away.
And it seems, from our test, that it does.
And so that was like a blunder and open
the eyes part before they used to have 400
token windows, and that was better, like
from our statistics, it shows us better.
They should change that.
Nicolay Gerold: Yeah.
Maybe an intuition basically on
the chunk size, why it could work.
I think there are like general
writing guidelines, how
long a paragraph should be.
And it's mostly around
like 100 to 250 words.
And this basically comes
down exactly to like this.
Token range.
You also have, if you assume like 2.
5 tokens per word, this
comes down exactly to it.
And if you treat like a paragraph as a
unit of thought, which is pretty coherent,
you actually can come down to that.
Brandon Smith: yeah, absolutely.
Yeah, actually, that makes tons of sense.
Yeah, you can see it's so much this,
and yeah, I think I blundered on that.
Nicolay Gerold: Why the large overlap?
So can you maybe explain for
what the overlap is in general?
Like, why do I actually integrate
the overlap in the different chunks?
And.
Brandon Smith: Yes, it's,
yeah, sure, of course.
Yeah overlap.
Now, of course me and Anton,
we, in this project and the
idea, okay what is overlap?
Why do we have overlap, right?
The reason why we have overlap is that
these chunking algorithms they can't
even, they can't see what's inside them.
Overlap only exists for these recursive
tokens and the fixed token splitters.
And the reason why, let's take the example
of the fixed token splitter, it's quite
easy to say, is that let's say you have,
let's say you have a chunk size of four,
like I don't know, like five words, right?
And you have a sentence, you can imagine
that it's going to break that sentence,
it's going to break that sentence.
It might, the chunk might end
midway through a sentence.
And the reason why we do chunk overlap
is that if we break something by doing
chunk overlap, the next chunk starts.
a little bit behind, and so that whole
sentence, it was one that got broken
in the previous chunk, and the next
chunk, it covers it, and it picks it
up because it starts a bit before.
So it's the idea that because one chunk
might end early, the next chunk starts
early to catch whatever was broken or not.
That's why chunk overlap exists.
So fundamentally, It exists because the
trunk have no idea what's inside of it.
They can't tell.
The trunk can't tell if it broke a word
in half or if it broke a sentence in half.
So we use trunk overlap as as far as
I'm concerned, as a way to basically
deal with the fact that oops, we may
have just broke something in half.
Let's start early in the next chunk.
So it's actually, when you think
about it, quite a hacky solution.
And so with me and Anton in our
project, we wanted to the ideal trunk
should have no trunk overlap at all.
To an extent.
There really shouldn't be
any chunk overlap, right?
I say to an extent, cause sometimes if
you wanted all sorts of stuff there was
this one thing you could have let's say
someone has a paper and they had multiple
experiments, experiment A, experiment
B, experiment C, experiment D, and
let's say every paragraph, they start
with experiment C, experiment A, we did
so and blah blah blah, next paragraph,
experiment B, we did so and blah blah
blah blah, and what not One might
ideally like a chunk, which literally
just says an experiment, like just the
first sentence of each paragraph, all
in one chunk, because then somebody
asks, Hey what would experiments done?
You'd retrieve that chunk.
Whereas another person might
like a chunk of each paragraph.
In that sense, there's been chunk overlap.
Because you have you have information
being repeated, but that's quite
complicated, tricky algorithm.
And it's not quite clear how someone
would develop such an algorithm.
So we pushed that aside.
Cause that's like quite complicated.
It's basically saying let's re
chunk for a different query.
So that's like a different topic.
And especially when you're like, you're
trying to create independent You have to
ideally, you have to chunk up that problem
that ideally you'd rather not have.
And ideally, you would just have
to chunks be aware of the semantics
inside of them and split correctly
so that they can just be left apart.
So yeah that's the whole
thing on chunk overlap.
Nicolay Gerold: Yeah, one
chunk I also want to double
click on is the LLM based one.
Can you actually explain how it
works and how it basically separates
the text into separate chunks?
Brandon Smith: Yes.
So we, so we tried a few.
My, my boss originally posed the
idea and I thought that was insane.
I was like this is silly.
This isn't gonna work.
But then we thought about it.
And okay, maybe it could work.
And it was a fun experiment.
Why not?
The first one we tried was just,
was just actually just silly.
What we did originally was the
original way is that we literally
handed a massive chunk of the
corpus, like fundamentally you know,
these models have a context window.
You can't hand them that
much information, right?
So we'd hand the first 4, 000 tokens
You say, Hey, GPT, let me like GPT
40 here's like the 4, 000 first
4, 000 like tokens of a corpus.
Can you respond to me like in
quotes, say chunk one colon, and
then write a bunch of texts and
then chunk two and blah, blah, blah.
And that's, I tried that algorithm.
Now, sure enough, as you can imagine, that
had a million trillion problems, right?
The very first problem being that it
would just write its own stuff, right?
It'd say chunk one and just go
off on its own journey, right?
Nothing related to the actual document,
or it would be similar, it would
hallucinate, I I'm, like, I, before I
worked in a different, I worked in EdTech
I was quite used to having to parse
outputs from GPC, like JSON parsing,
and As a thing I used to quite often
do is like you'd validate its response.
So I validated it so I would do
an exact text search from whatever
responded with the original, to make
sure that it's actually quoting the
original when it cop writes out a chunk.
And it would never quote it.
If there was like, sometimes the grammar
in the original chunk would be a bit
dodgy, and it would fix the grammar,
or it would fix the typo, but then
suddenly that's no longer an exact copy,
and all these things, that was a mess.
And more importantly as well, if
you think about it logically, you're
saying that to chunk a document,
you have to pay the inference cost
of every token in your corpus.
That's really expensive, right?
It's 15, I think it's 15
pounds per million tokens.
Maybe it's not bad.
I don't know.
It's not good.
It's not good.
Paying an inference cost for every
single token in your document like
of GPT isn't really a good idea.
So we dropped that and instead we
made a much nicer system, which
was quite smart, which was What
I did is, I basically, I stuck a
bunch of tags inside the documents.
So I, and again easier
seen than explained.
You can see, if you look at the PDF or
the paper, you can see in the element
semantic chunker, what the text looks
like after the tags have been added.
So it literally starts with like
open angle bracket, chunk zero,
close angle bracket, and then there's
some text, which might be some
sentence, blah, blah, blah, blah.
And then open angle bracket,
end off the first like bit and
then the start of the next.
And then because of this, I could then
just actually ask you, pt, I can say,
Hey, read through all these like like
groupings of the text and say what?
To stay together.
So like maybe the first three, three
groups should stay together and why?
And they have to be consecutive.
So in fact, actually you
just tell me when to split.
So they might read like everything
that's in the first grouping.
Everything that's in the second
grouping, in the third grouping.
And say it's just split off to free.
And so then we just do that together.
And so by doing that, GPT can literally
just transform with 3, 6, 7, 8, and then
it'll also actually give an explanation.
It'll be like, I decided to split
at 3 because blah blah blah blah.
It turns out, I didn't see it because
I was just parsing at the JSON, but
I had it print the output one time.
It turns out it yaps a bunch more.
But anyway so it does that.
And that's how it works.
And honestly, quite
surprisingly, it was pretty good.
It actually performed quite well.
Nicolay Gerold: Yeah, so in the end,
like the semantic one and the LLM one
is basically we split up the text.
It's separated into like chunks already,
like into really small pieces like atoms.
And then you basically merge it
back together based on, for one, the
similarity, and in the other one,
based on whatever the LLM predicts.
Brandon Smith: Yes.
A hundred percent.
Yeah.
So similar ideas.
They're actually very similar ideas.
Yeah.
In that sense.
Nicolay Gerold: Nice.
What were the other chunking methods
you would highlight from the paper?
Brandon Smith: So my, Greg, Kamrats.
I'm really sorry if I mispronounced
his surname, but Greg, really cool guy.
I had a call with him.
He is the guy who pretty much
started this whole thing.
He had that notebook that a
lot of people saw the like five
levels of semantic trunking.
He made a chuck here himself
and here's what's cool.
His work by having like a rolling window.
And we implemented that.
Unfortunately, like mind you look,
Greg made this very clear, he did
this whole thing as a full experiment.
He didn't like rigorously
try to fine tune everything.
And this chunk performed very poorly
on our metrics, but the reason
why it performed poorly is that it
produces really large chunks, because
it was never really it produced
chunks of over 1, 300 tokens.
And the bigger the chunk size, we've
mentioned this before, the worst
it will perform in Germany, right?
If they're too big.
And that's because, like, when he wrote
this algorithm, he didn't realize,
that it's actually, it matters a
lot that you keep your chunks small.
It's at least, in my opinion,
it's a restriction of the
embedding model, actually.
That the chunk should be at
a certain size, otherwise the
retrieval gets a little bit bad.
So yeah so we used, yeah, so we tried
a Gregs chunker, and it was alright.
But it's a very important chunker.
And we actually made an improvement to the
chunker, where we have a parameter into it
that says, hey, chunks can't be any bigger
than this amount, we said 400 tokens.
And it actually improves
it by quite a bit.
That was another chunker, pretty cool.
And there was also another chunker, one
like from it was like a pine cone chunker.
We tried that and they were pretty good.
I would have liked to include it in the
paper, but there was stuff happening.
And there was some other
details we had to get sorted.
Yeah.
Nicolay Gerold: And what were the chunkers
that didn't make it into the paper?
So the different mechanisms you've
tried and that blatantly failed?
Brandon Smith: Yeah, good question.
What do you think?
There's definitely a few.
There's semantic chunking.
It's been a while, so we didn't
chunk it quite a while back.
Sorry, let me think.
What chunk did we try that failed?
Was it LLM Semantic Chunker?
Yes!
Yes!
Okay, sorry.
Damn, I'm completely blinding.
Context site.
We did, man, we did a lot.
In the abstract we have some
of the failed chunk algorithms.
And there was like this, there
was a paper called Context Site.
Really cool paper.
And they what they try to do is that
whole paper was to do with sighting,
which is given a rag pipeline, how
can you sites yeah, they got a touchy.
Okay.
Yeah.
So they basically, they had a paper
where they were like, it's the
whole idea of it was to like help
sites where like your your LLM
response comes from in the prompt.
And I tried to write some
nice algorithm that uses that.
And that says, okay that was basically
like, it was basically like trying to
use the algorithm that they designed
a really nice algorithm that paper,
it was trying to do that to find how
relevant each chunk was to each other.
That was one thing it's, it was
somewhat of a messy implementation.
Most importantly.
The whole thing failed massively, because
it got really confused by Japanese.
There was some Japanese text,
and it just seemed to throw
off everything really badly.
Because it all blew up.
And honestly, I'm actually quite curious
to know if it would I honestly think that
might have an effect on the normal paper.
The context side people
should give it a shot.
It seems like language has
thrown off the whole algorithm.
Yeah.
And then we also tried logic
and attention based junking.
And, mind you, like I'm not going to lie,
this was very much experimental, right?
This is one of those things where you
go in just being a little bit curious,
but you don't really have a plan, and
sure enough, nothing works, right?
So I tried making I tried I tried using
an already good chunker, like a fixed
token chunker, or one of our cluster
semantic chunkers, and chunking, and
trying to see if we then pass if we pass
the entire corpus through LLAMA3, and
see what the attention models look like.
I had the attention values.
And like the logits, and I'm trying to
see, okay, is there any like signal here?
Okay for example, you can
take the logits, right?
At any given point, you can
see the probability of a
word being generated, right?
By just running it through Llama.
And so you might be like, okay,
is there any signal on that?
Do we see that actually suddenly the
probability gets really low when it
changes from one theme to another.
And the answer is, yeah.
A little bit, but not really.
It's hidden in the noise.
Think about it.
The start of the next sentence,
you could say anything, right?
The probability of the next in the start
of the next sentence, it's really high.
Or maybe it's not that high,
and you start with a the.
But then after the, you
could say so many things.
The tree, the cat, the man, the person,
the block, the paper, whatever, right?
And that was like, all noise.
And it just threw the whole thing off.
It like, you couldn't find the
semantic changes within the noise.
It's just how actually
quite chaotic languages.
But it's in our, it's in our appendix.
If anyone's curious and there's
some cool pictures in there.
Nicolay Gerold: Yeah, I've
tried something similar before.
We basically tried to model the
probability distribution of the logits
Brandon Smith: Oh, really?
Nicolay Gerold: It's it was chunky.
It's I had to do so many different
assumptions and stuff like that,
and it didn't work so well.
I think the logits aren't really
optimized for like themes.
They're optimized for basically a
single token output and not for the
modeling, the entire sentence, for
example, based on which I basically have.
Brandon Smith: yeah, exactly.
Yeah, that's it.
Because as well it's very
quickly jump back, right?
If the moment the sensor starts
getting onto something, immediately,
it'll just lock into something, right?
And so it's yeah, you could just, you
can never It was hard to say it like
it never got confused this idea that oh
actually the tokens are like there'll
be high entropy tokens where like it's
given the probability to tons of different
tokens no it doesn't like it very
quickly locks into some theme but yeah
Nicolay Gerold: Yeah.
If we would have infinite time resources
at the end of the study, what chunking
method would you have loved to have tried?
Brandon Smith: I mean I would have
liked I would have liked to think longer
and harder about the very fundamental
first thing we said which is that
the perfect chunking algorithm would
take a query and find the relevant
tokens amongst the document that
can actually that answer that query.
That's actually talking.
I wasn't like, it's totally true.
Choreography.
Great.
But ultimately, that's
actually what matters the most.
It's usually one of the
right stuff for the query.
Yeah, I would have liked to work on that.
There's something that I said, Colbert
have some work on that and it's
something you could do with that.
Maybe that's a sliding window
technique where you like slide.
That's how it works anyway.
Yeah.
Yeah.
Yeah.
But other times even just
thinking about it now, it's just
putting my head into a spin.
There's a lot to think about, because as
well a lot of times it's disconnected.
Again if the question is can you get me
the experiment names of every experiment
in this paper, you have to retrieve a
bunch of disconnected pieces, versus can
you tell me what's its exp give me the
third sentence of the first paragraph.
That's a very weird question, but
someone could reasonably ask that.
All these different things.
But yeah it'd be, I think it'd be
nice, and someone should it's very
important that one day, and this
will happen eventually, that we
have a chunker that can do that.
It's not a chunker, like a whole
RAG system that can really retrieve
the exact tokens, because that's
what fundamentally matters.
That's something I would like
to look into, but it's just hard
to see how you can tackle that.
Nicolay Gerold: Yeah.
And I think this goes more into the Entire
search literature or the traditional
information retrieval literature again,
they actually spend way more time actually
thinking about what are the different
query types you will encounter and what
are the documents I have, which documents
fit to which document to which query type,
and then you basically try to deconstruct
the entire search system and ensure
that it matches in the most optimal way.
And I think yeah.
In Rack, we basically
threw all of that away.
We embed the documents.
We embed the query, and we hope
something good comes out of it.
Brandon Smith: Yeah, right?
Yeah, no, exactly, you're right.
And it's true, it's a massive, it's
a massive field that people have
worked in for a long time, right?
And, yeah, I think it's true,
we really should spend more
time revisiting old literature.
Ciao.
Nicolay Gerold: Yeah.
You already mentioned that you threw
away one chunking algorithm or method
you tried because of the language.
Are there any additional language
specific considerations for the
chunker people should be aware of?
Brandon Smith: Let me think.
Generally, here's what I'd say.
I'd say, with most chunking algorithms,
and even the ones you threw out, the issue
was when we had a mix, where we're in the
same corpus, we had Japan this one is,
we had Japanese appearing, in the amongst
a bunch of English texts, and the issue
with that is that it seems like And I know
these embedding models, they try not to
do this, but it seems like they are still,
they still treat it a bit separately.
They're like, this Japanese
and fundamentally it's true.
A Japanese word is very
different from an English word.
Even if the words are about the same
topic, there's still one axis, language,
where they're very different, right?
I wouldn't say that the word, whatever,
I don't know, the Japanese word for a
tree is, and the English word for tree.
Yeah, they should be somewhat similar.
They should be closer apart.
They should be close together than I don't
know, maybe tree and I don't know, car.
But at the same time,
they're not the same.
They are two different.
They are words in different languages.
So in the sense of that, in the language
axis, if there is like a dimension
in the embedding model, which is the
language, they're very far apart, right?
So anyway, so yes, you will
find that there's going to be
some disconnect in similarity.
But if you have multiple languages
in the same document, I think if
they're all the same language, none
of these problems should appear.
If it's all the same
language, then it's fine.
That is eliminated.
Yeah, I haven't, it's true.
I haven't tested actually
our cluster, MATLAB chunker
amongst different languages.
I think the LLM chunker would do just
fine because it's a large language
model and it's a smart machine.
Nicolay Gerold: Yeah what was a thought
experiment a while ago from mine was
thinking about trying to bring or
reduce the different words of Latin
based languages back to the Latin
equivalent and training embedding
model on the Latin equivalent.
This is something I would
love to try at some point.
Brandon Smith: That'd be
Nicolay Gerold: Because I think
it, the embedding when it gets
more robust across the different
Latin languages that give you.
Take any of those Spanish, English,
German, all have heavy Latin influences,
and if you basically trace back the
origin of the words and embed that this
could be an interesting thing to try.
It would probably, it wouldn't
be worth the effort, but it
would be like a fun research
Brandon Smith: Everyone does perfect.
Yeah.
There's philanthropic things you
do when you make tons of money.
Nicolay Gerold: Yeah.
When you look ahead or during your
research, did you find any research
areas that you think are very promising
or very interesting for the future of
like document chunking or rack systems?
Brandon Smith: I think we're happy
with, we're happy with the cost
of matching chunkers is good.
The LLM chunkers is good.
They can be a little bit slow though.
I think fundamentally to be
honest, I actually think this is
a pretty close subject, right?
It's this is chunking
and it's the first part.
I think we showed a good
bunch of light on that.
I think really though, what it
highlights is that people should
think about chunking and ideally.
I think somebody, I'm hoping someone
out there can start thinking about
the ideal token retrieval, right?
That's really the best thing.
That would be a future line of research.
We haven't actually made a dent in that
cause it's just, it's fundamentally
very different given the query.
Can you retrieve the
exact tokens in document?
How can you do that?
I don't know.
There's a lot you could do, but
it's just very different from
what everyone's been doing now.
And like as well, the entire vector
database kind of goes out the
window when you do that, right?
I'm not really sure it
depends how you implement it.
I'm not sure, but it'd be very tricky.
That's, I think, a very
promising line of research.
How you could tackle it, I'm not sure.
I think it fundamentally
needs to be done, and it will
eventually be done at some point.
Nicolay Gerold: Yeah.
And if people want to basically get
in touch with you, read the research
papers, hire you, hire Chroma, where
can they get in touch with you?
Brandon Smith: Yeah, that's true.
You can get me on LinkedIn, Brandon Abreu
Smith, or go to my Twitter actually.
What's my Twitter name?
Brandon Starxel that's
Brandon S T A R X E L.
It's a bit random.
That's just my code name on there.
Yeah.
You can get me on either of those.
I love it.
Come like my LinkedIn posts
Nicolay Gerold: So what can we take away
when we are building rag applications?
So current chunking methods
assume text to be continuous.
Which is a correct
assumption in most cases.
Text flows naturally, ideas build
on top of each other, sentences
combine into ideas, and words
link together to form meaning.
But most chunking methods
actually even break this flow.
So for example, you have the sentence,
the example Brandon gave John buys a
red Ferrari, it is the fastest in town.
When you split by sentence, You completely
lose the connection, and especially the
second sentence loses a lot of context.
What does it even refer to?
And current solutions try to
patch it by overlapping chunks.
So basically, if we split by sentence,
we might include the sentence
prior and the sentence after.
And hope that the context is maintained.
But this brings a lot of its own problems.
First of all, we store the
same text multiple times.
We have to use more compute power.
And also, it makes our
database messy with duplicates.
You might only retrieve chunks.
because of the included overlap
and not because the chunk itself,
without the overlap, would actually
contain something meaningful and
important for the user's query.
And also, we have context that
could be completely out of the flow.
Of course, it can be in the chunk
above, but it could also be in a
different document, or it could
actually be in the head of the
person who has written the text.
Thanks.
So overlap is really an
imperfect solution for trying to
contextualize the information.
And we actually need better solutions
to do this contextualization
and also to do the chunking.
And the solutions for that could be
something like contextual retrieval.
We could use LLM based chunking
which is in the end, a form
of the contextual retrieval.
or as a quick and dirty solution you
can use the overlap but you should be
careful with that if you implement it
make sure that the context is actually in
a flow and is not coming from somewhere
else and one interesting thing actually
which is my theory why the chunk size
works for 200 to 400 chunks or 400 tokens
sorry it's In most writing guides, and
this is followed by most writing, is
that a paragraph is 100 to 250 words
and one paragraph handles one idea.
And this really matches well with
the, what the chunking research
actually says, which chunk sizes
work best, like 200 to 400 tokens.
And I think this is actually the sweet
spot because if you treat a paragraph as
a unit of thought, this really lines up
that you take a chunk size which aligns
with that unit and it just makes it
more likely that you have a semantically
coherent and contextually relevant chunk.
So when the chunks are too big,
They dilute meaning and when the
chunk size gets too small you
basically break up the context too
much that you lose the information.
The two main solutions I think he
came up with is for one the semantic
chunking and in semantic chunking you
split the text into smaller segments.
I often like to do sentences,
and then you create embeddings
for each segment and measure how
similar segments are to each other.
And then you group related
segments into coherent chunks.
And you could do that unstructured,
so basically independent of
the order of the segments.
What I like to do is actually
use a rolling window.
So usually I have a minimum chunk size
and a maximum chunk size and I slide over
the document and measure the similarity.
And once the similarity to the
next sentence basically is below a
certain threshold, I basically say,
okay, we have to create a new chunk.
And I basically build on the assumption
that the authors spent some thought on
the in the order of the document and that
it's not like completely independent.
And through that, I basically try to
find natural topic boundaries in text
to see like when one topic ends and
another begins, I basically create
two different chunks and I can avoid
a little bit of the overlap issue.
The second solution he found
was the LLM based chunking.
So you basically teach the machine
to think a little bit like an
editor and you use a language model
to read the text and decide where
it naturally breaks into parts.
So the LLM marks potential
breakpoints with a special token.
Then you ask the LLM to
evaluate these points.
And then you have basically a
decision which sections should stay
together and which should be split.
And you use these decisions
to create the final chunks.
And this really banks on the
idea that LLMs actually can
understand context and meaning.
And I think these, this method
really has evolved into the
contextual retrieval by Anthropic.
Where you On top of basically
the chunking, you add back in
the contextual information.
So you basically have the entire document.
You look at the chunk, feed it
into DLLM and ask it, Hey, what's
missing here from the context of
the document that I should insert?
And that's it.
When I use it, I don't just feed
in the prior document, but also
important definitions for the
field, which might diverge from
like the knowledge base of the LLM.
So usually you have different
definitions of something, depending
who you're asking or who you're
talking to, or what field we are in.
And you usually also include
these definitions to basically
make it contextually relevant.
If we had a legal use case recently.
where we also included a lot of like
legal information basically like
different paragraphs and stuff like
that to also contextualize it even more
and through that you basically try to
create like semantically different chunks
but still try to get chunks that are
completely self contained and contain
all the information you might need
to answer questions about that chunk.
And I think this is a
really interesting approach.
I think we are evolving into a future
where the size of the information
isn't the driving factor, but for
that we also need better embedding
models that can handle longer chunks.
And I think we have seen one come
out recently, which is modern bird.
Which performs better on
longer context windows.
And if we get there, we
actually can be more flexible.
in the chunk length so we can actually
create chunks of different sizes or with
different size context windows and focus
more that they are semantically self
contained and handle only one topic.
So to recuperate a little bit
Best practices for now is like
keeping the chunk size in the
sweet spot of 200 to 400 tokens.
This is something we
already heard by podcasts.
I will link to it in the show notes.
And also figure out a way to
keep the chunks semantically
coherent and self contained.
If you want to hear more about
that, I have had a recent
episode with Max Buckley.
We really went deep into
that on how we can do it.
So yeah, that's it for the episode.
If you liked it, Let me know below.
Leave a leave a review.
It helps us out a lot.
Otherwise we will continue next week.
I wish you a happy new year.
Otherwise, I will catch you soon.
See ya.
Listen to How AI Is Built using one of many popular podcasting apps or directories.