· 54:57
Nicolay Gerold: You're not typically
dealing with clean data as in research.
You're not dealing with well-written
spell-checked queries you're
dealing with messy, real data
PDFs, word documents, HTML.
And so far, we had to hand build
the pipelines to extract the text,
clean it, bring it in a format.
That's workable.
But often we are actually restricted
to workable because you couldn't
ever cover all the edge cases off
how the data is actually represented.
And today we are looking at a way.
That could make all of that redundant.
By helping AI look at documents
or like we humans too.
This is Nicolay
I run an AI agency and am the CTO at,
a generative AI startup Today we are
back continuing our series on search.
And especially on embeddings
today, we are learning from
Jo Bergum.
Jo is the chief scientist
of Vespa and at the moment.
One of the chief evangelists of
ColPali and he's working on making
it practical for production.
We talk about how ColPali works,
how to fine-tune it and how
to apply to your use cases.
There's actually a banger of
an episode So Let's do it.
Jo Bergum: this frustrates me a little
bit is whenever I tweet about some
language model or some development,
or people will come at me and say,
yeah, but I tried this and it's shit.
So when you're working on search systems.
You know that we won't get
every query perfect, right?
So we have this more like a probabilistic
system, which might not respond in the way
you were thinking that it would respond.
So you think the response is shit, but
on, when we are writing search systems,
we are looking like, Over a broad set of
queries, are we doing better for those?
Maybe we are doing worse for some.
There's this differential between
head queries that are frequent tail
queries that are less frequent and
somewhere in the middle of the torso.
And this frustrates me a little
bit when people come at me and
say, yeah, but I tried this model.
It doesn't work for
this case and that case.
And that's also when I
talk to practitioners.
That are actually being in, in this
industry in search or working on
machine learning systems and so forth.
They are more excited on the recent
developments because they know that,
yeah, they don't work perfectly, but
you can fit them into some workflow.
We can do more now than just
three years ago or just two
years ago or just one year ago.
Yeah.
So there's definitely progress, but
people are somehow that are not used to
this or have this kind of ML background.
I think they expect these systems
to be 100 percent from the get go,
and you're not going to get that.
That's also one thing
when I look at retrieval.
I think also that's also a little bit
of the frustration from people coming
into now the retrieval space, the
search space, is that they're expecting.
more that the system will
be perfect from the get go.
And then they'll start
looking for different methods.
Oh, I tried this method.
I read about it on Twitter.
And then, I'm trying retrieval.
I'm trying vectors.
I'm trying Colbert.
I'm trying this other.
So they're running around trying these
different methods, but they haven't
got the kind of basic evaluation
methodology or mind of thinking.
Yeah.
Nicolay Gerold: Yeah.
And I think that's like a good prelude,
basically, like the user perception and
what you're building and that you're
trying to basically map this together.
Can you maybe go into like, why do we
actually want to map data, especially into
like dense representation or in general,
like representations that's numeric?
Jo Bergum: Yeah, there's a great paper
from Jimmy Lin on conceptual framework
for why we want to have a representational
approach in information retrieval.
So the idea is that in a retrieval system,
you have some kind of collection of
documents that you want to search through
and retrieve information from, and you
have a user with a information need.
And the user expresses
this information need.
Then we can have the system essentially
interact with each of these
documents in the collection.
If you have some kind of magic model,
let's say, that can estimate, if this
document is relevant for this query,
then we can basically iterate over all
the documents in the collection, invoke
this model, and the model can say if
it's relevant for the query or not.
And we can do this for
all of the documents.
But the problem, of course, is that
this is rather compute intensive.
If you want to build a system
that is responsive and that
maybe also have a low cost.
For instance, at Google scale, they
cannot spend a lot of compute per query
because they have a lot of documents.
And this is where you turn to a
representational approach so that you
can map the query text and the document
into some kind of representation matching
with a similarity metric that you can
Perform a search in a subline time,
meaning that you don't have to consider
every single document in the collection
to invoke this kind of magical model.
So that's the whole premise of
the representation approach.
So there's a lot of words to describe,
like what most people would think.
Like a elastic search, oh, I need search.
I'm gonna go with Elasticsearch,
gonna index my documents.
They will tokenize it,
invert it, and so forth.
But that's one type of representation.
Then you have a vector database.
Oh, they're dealing with a
dense vector representation.
They're dealing with vectors.
So I input my vector, but they're
all on representational approach.
And the idea is that you can do this to
scale the system so that make it cheaper.
And he also touches on in the paper
Supervised representations versus
unsupervised representations.
And basically if the
representation is learned or not.
And if you look at the text embedding
models that people are maybe familiar
with OpenAI, Ada and so forth.
They've been taking text and documents
and inputting it through a transformer
based model or some kind of model.
And then they learned this
representation from relevance
data where you have like this.
For this particular query,
this document is relevant.
For this particular document, this
document is but it's irrelevant.
And using that training data, you learn
representations, and you learn how to
map these queries and documents into
this vector space, so that the relevant
documents are closer in this space, and
the irrelevant documents are farther away.
You can do this also with sparse
representations, like we've seen
a model called SPLADE Which also
uses a transformer model, large
representation, but instead of having
a dense representation where all the
dimensions have a value, they essentially
cut out a lot of the dimensions and
they have a sparse representation.
And, yeah, there are a lot of variants
around this, but if you compare it
with BM25 for example, which is also a
text scoring or a representation, you
don't have a learned representation
because it's a sparse representation.
In one way it has a learned
representation because it has seen
all the documents, it knows all the
words in the collection, it knows the
kind of frequency, so it's more like
a statistical analysis of your corpus.
And, but we still, since it has
very few hyperparameters, we call
it like an unsupervised method.
So I, yeah, I really recommend reading
that paper because I think it's a good I
Very good overview and also setting like
vector databases versus search engines
versus whatever in, in, in context.
Nicolay Gerold: Yeah, and especially the
separation between the physical retrieval
model and like the logical scoring model,
which basically gives you like these like
the, this nice two by two on dense or
sparse and supervised and unsupervised.
Do you think the dense
unsupervised area is a little
bit under explored at the moment?
Jo Bergum: Dense, unsupervised in one way,
a lot of the so they've been trained on
some generic data like MS MARCO which is
this huge relevancy data set from Bing.
So there aren't that many available
datasets to train these type of models.
Most of the ones that you find on
Huggingface have been trained on this
dataset and you can generate data
synthetically, or if you have data
internally, like what you're building,
you can also train your own text
embedding models, but I think most
people actually use them off the shelf.
And then hoping that the learn
representation from MS Marco, will work
for their particular search use case.
And even they are getting better and
better, these embedding models, they
still might struggle, what I call,
in the out of domain setting, meaning
that it might not be, your use case
for search or retrieval might not
match with what they were trained on.
And there's a paper or a benchmark
called BEIR the BEIR benchmark, I'm
sure you touched on this, which is all.
collection of different data sets where
you can evaluate models that are like
generic unsupervised models and apply them
to bear and then see how they perform.
And in the first rounds, I think, the
models that will represent text as a
single vector, as a regular text embedding
model they were actually underperforming,
like the traditional baseline of BM25.
In a lot of my work around search
and retrieval, when I'm writing about
this, I try to include the baseline.
And BM25 is a, is actually a pretty good
baseline to, what's the minimum you can
do when you're building a search product.
I would say, okay, let's start with
the baseline, and the baseline is VM25.
Yeah.
Nicolay Gerold: And when you look at,
especially like the current paradigm
of search, which is more like the
extract chunk and embed what is in
the dense representation what is the
information that's, that gets lost?
Why is, It's still a fine tuning of the
embedding models might be necessary.
Jo Bergum: Yeah, I think it's
actually on the, maybe on the little
bit higher level than that, is that
so what are the documents that you
actually want to make searchable?
Are they like text documents?
Are they PDF documents?
Like, does the layout really of the
documents really impact if a user is
actually looking at some result for
a query and is actually looking at
that file using his eyes then things
like where is the text centered
on the page might impact, if you
find this document relevant or not.
And If it has a figure or signs or
graphics, that also might change our
perception of if this document is
relevant in addition to the context
that the user is and so forth.
But if you look at just text, like if you
already have somehow magically converted
everything into some clean kind of text
then of course how you actually go through
this data and in the chunk and embed,
of course, this is going to impact the
representations because in a regular
text embedding model is that you run
it through an encoder style transformer
model, which actually on the output
layer outputs one vector representation.
per input token.
So it's actually not one vector,
but it's a vector per token or word.
And then the text embedding models,
basically it takes a pooling operation
or an aggregation operation to
represent this as a single vector.
And the historical reason for redoing
it is that the transformer or the
encoder models like BERT, they
have a length limitation of how,
how long they can actually accept.
But not only that, I think it's very
difficult to have a good representation
for high precision oriented search
when you are basically averaging all
the vectors into one vector, right?
Imagine a very long text.
How do you actually summarize
that text by just averaging all
the words in that text, though?
And that's also what we see with
other types of models that learns
representations, like Colbert, that
learns representations per word.
instead of trying to learn a
representation for the full text.
Yeah.
Nicolay Gerold: Yeah.
And the, what is the, I think like one
interesting part of the paper was also
the quality cost space and time trade off.
They mentioned, can you maybe
explain like the dimensions and
give one or two examples, like very
different models or approaches for.
Jo Bergum: Yeah, this is, I think
this is a huge topic, right?
But if you look at the representational
approaches first, the things that tries
to map things into representation,
that you can have some similarity
function, where you can do at least
a kind of candidate retrieval phase.
There, of course, using a single
vector representation, Per
document is beneficial, right?
Because then when you're doing a
search you only need to do a rather
simplistic dot product or a cosine
similarity, which on today's computers
or GPUs or whatnot, it is pretty fast.
Then you have models like Colbert,
Which retains a vector per word, and
then you have more complexity, right?
So you have both storage complexity,
compute complexity, and so those are
like the representational approaches.
I skipped SPLADE now, and the
sparse representation is also a
possibility to accelerate that.
And then you can also accelerate it
using various approximate nearest
neighbor search algorithms and so forth.
Then you have cross encoders, and
the huge difference is that they are
inputting both the query tokens and the
document tokens, or the feature data in
general, into one model at the same time.
So those are typically
deployed as re rankers.
So you already had a candidate retrieval
phase, and then you're re ranking them,
and then you have full interaction.
And because of the way that, If we're
talking in the transformer model context
they are quadratic with the input length.
So when you're inputting the full sequence
of both the query and the document at
the same time, you have more compute
than when you're doing it separately,
as you do with Colbert or you're doing
with a regular text embedding model.
But it's a huge discussion.
I think in one way, one thing that
frustrates me is that when people talk
about this thing, they talk about speed.
A latency.
Is it fast enough?
And the thing with search systems is
that you can paralyze a lot of the work.
So that you, for instance, in the
CPU architecture when you're doing
retrieval and ranking, you can actually
shard this work over multiple CPU
cores so you can get the latency down.
That doesn't mean that
the cost is too high.
is actually decreasing, right?
Because you're increasing the cost.
You're using more resources.
If I'm asking someone to paint my house
and they say, it's going to take one
company and say, it's going to take one
day and the other company is going to
say, ah, it's going to take four days.
But if I then look at, just
look at the latency, right?
Because.
The company that says it
can do it in a day, they can
throw 10 people at it, right?
But it's going to cost me more in
terms of money than the other company
that might use fewer resources.
So I think our human perceptions
of latency is that, we think
about speed, how fast we can run.
We're not used to thinking about, that
the workload can actually be paralyzed.
So that's one of the things that
persuades me a little bit when
I get questions, can it be fast?
And yeah, a lot of these
things can be fast, right?
So when you look at searching in
Google, you're searching and spending
actually quite a lot of compute.
But it's fast because
they paralyze everything.
Nicolay Gerold: Yeah.
And I think like most
people don't really think.
In terms of trade offs, but rather binary.
And you have to think about like
your application, like what are you
optimizing for, and this optimization
will come at a cost of something else.
If you're going for more quality and
at the same time speed, you will have
more costs and likely also more space.
Jo Bergum: yeah.
Nicolay Gerold: And do you think that
the like different representations,
how do you think about them?
Do you think about them like in terms of
like different search signals as well?
Jo Bergum: I think in the discussion
of these different dimensions that
you're thinking about, and especially
if you're new to this field, And
you're trying to build something
search, instead of thinking about how
can I make this efficient or fast or
whatnot, what is actually improving
the results of what I'm doing, right?
And that comes back to our earlier
discussion about evaluations.
So you actually can quantify,
is this better than this, right?
And if you haven't got users and you
haven't got the scale, then meaning
that a lot of query traffic or a lot
of documents you can probably get away
with, using a more expensive model
in any of these dimensions, right?
Because you can get the latency down
to have a good user experience, right?
With this.
But as your service becomes more
popular, you want to bring in more
signals, then you also have interaction
data that you can start working on.
And using, to actually.
use maybe a less resource
intensive method, right?
So let me be more precise.
If you start out with a cross encoder,
you can make this work, pretty fast
today with modern compute and so forth.
It's expensive.
Okay, it is, but you don't have any,
a lot of users at the time, and you
can quantify that this is actually
improving the result quality by 30%.
Then I think you should do that.
And then gradually, if your
service, becomes popular, do
you have interaction data?
You know what people are searching for,
you know what they're clicking on etc.
Then you can start looking for cheaper
alternatives to Kind of distill the
knowledge of that expensive model
into a more practical thing, right?
And that's essentially what Google
did and others did, as they, they
had to scale out infrastructure.
So don't cosplay Google if you're
not operating at Google scale, right?
Nicolay Gerold: Yeah.
And I think that's really interesting
because it's, it has become so easy to use
really complex model that you know, we can
start out with a more complex application
and then basically slim it down as you
gain more experience and more data, which
allows you to train more complex stuff.
Jo Bergum: Yeah.
But then again, don't, if you also, then
it doesn't actually improve the result.
Then you're just throwing compute and
complexity and resources out the door.
So it goes back to having
that kind of evaluation.
And I think that goes back to also do
what I said, my one advice would be
to actually take a course in, in how
you actually evaluate these systems.
Nicolay Gerold: yeah, it's like in one
or two sentences summarized, have a test
set, have a strong baseline like BM 25
and run it against it and see whether
it actually gives you a bang for the
buck if you introduce something new.
Jo Bergum: Yeah.
Nicolay Gerold: And in the paper, they
also mention creating new representations
through the densification of something
like inverted indexes or The opposite
sparsification of dense vectors.
Do you see potential in these
approaches or any new research
directions that already opened that up?
Jo Bergum: I, yeah, personally I'm not
that into that because when you get
into these details, it's more for me
on the details of how you accelerate
and reduce the cost of a system.
And what we know today is that
dense representations and nearest
neighbor search algorithms like HNSW,
inverted file, there's a ton of them.
They have some really good
properties in terms of scaling them.
Meaning at least on the query side,
compared to sparse representations.
We didn't touch so much about sparse
representations, but when you're using
like a sparse learned representation model
like SPLADE what essentially it does is to
expand both the query and the documents.
Yes, there are variants here, but in
essence it expands the query and the
document, so that you have more tokens
of words that have a non zero weight.
That complicates the process of
doing accelerated retrieval using
the weak end or block max weak end
and so forth, over inverted indexes
like you will find in both Vespa,
ElasticSearch OpenSearch, and so forth.
And they have a little bit different
properties in terms of scaling.
If you, they, they are slower or
they are more compute intensive.
Thank you.
If there are basically, if
the representations aren't
that sparse anymore, right?
When you're expanding the
sparse representation, so you
have an original 3 word query
becomes a 56 word query instead.
Then it's not so fast anymore
or compute efficient compared
to dense representations.
But personally I love Vector search
and vector representations because
they are flexible, but still we have
to keep in mind that using just an off
the shelf text embedding model might
not work that well compared to BM25
in, in the out of domain situation.
But I'm really into dense
representations, yeah.
And the other great aspect about dense
representations is that it's very
easy to move them around in space.
For example, for personalization
and others, you can have
adapters on top of these vectors.
They are very, they are learnable.
They're differentiable.
Comparing that to a lot of the sparse
technologies and sparse methods.
Yeah,
Nicolay Gerold: Yeah.
And one thing we have only
tangentially touched it.
It's like actually that you have
to match the representations
of the query and the documents,
which often are in very different
languages or how they are formulated.
How do cross encoders like Colbert.
Make that easier.
Jo Bergum: Yeah.
So in, when we are representing
things, We're not longer.
So yeah, so let's take the dense
representation as an example where you're
using a single vector on both ends.
So we call those by encoder
or text embedding models.
You're not interacting directly
with tokens, like you're not
actually matching this token against
another token in the document.
But.
You have, you're only interacting
through this cosine similarity or
dot product in this vector space.
So you no longer have token interaction.
But in the cross encoder, a model
which actually inputs both the query
and the document text and all the
tokens at the same time, they will have
attention on the token level, right?
So for each of the tokens in the
query, you will have attention
to all of the document's token.
And that's.
Colbert model is actually a Bi-Encoder,
but it has the interaction late.
So it's like a late interaction model,
where you have token level interactions.
And many believe that these
token level interactions, even
if they're late, they're doing at
the scoring time gives that type
of model a better, it generalizes
better to new domain or new text.
Because let's face it,
what you actually typed.
Does matter, right?
If you typed Coke you might want to
match a document that has Coke, you might
not want to match Pepsi if it has the
same similar semantics meaning, right?
Nicolay Gerold: Can you maybe
quickly define like the difference
between early and late interactions?
Jo Bergum: Yeah I think late interaction
was introduced by the Colbert
paper by Omar in the sense that The
query and document tokens are not
interacting when we are encoding them
with the transformer model, right?
That enables us to encode the document
corpus up front, offline, so we can encode
the document collection, and at query
time, which is the late time, we can have
that type of interaction between the two.
using this multiple vector representation.
So that's where the late part comes in.
But in a more like early sense would be
more like a cross encoder where you're
interacting throughout the transformer
model with the same inference call
and each of the tokens will interact.
Same with when I talk about cross encoders
is also, This also applies to using
regular LLMs, like Ranked GPT or things
like that, where you're actually using
more of a generative model where you
basically stuff the prompt with all the
documents and ask the model to rank them.
There you also have interaction
throughout the decoder network
between all the tokens.
But Colbert enables us to have this kind
of interaction with the tokens with neural
representations or vector representations.
Nicolay Gerold: Yeah.
And now moving on to the actual
topic of the episode, we discussed
ColPali, how does the late
interaction model was introduced to
like the visual domain and ColPali?
Jo Bergum: Yeah, so ColPali.
Takes the late interaction
scoring mechanisms.
from the Colbert paper or the Colbert
approach and brings that to and
combines that with a visual large
language model or visual language
model, PoliGemma from Google.
So it uses PoliGemma to obtain a
multi vector representation of a
screenshot where you essentially take
a screenshot of a page, for example,
a PDF page, and that page is divided
into grid cells, and each of these
grid cells are represented as a vector.
Like in the text domain with Colbert,
you will represent each of the text
tokens in the text as a vector.
And they are mapped into the
same vector space where you're
also matching text tokens because
it's vision language models.
It's combining a vision
encoder with language models.
So they have learned joint representations
of both text and image data.
And similar as Colbert, you do the
same thing with the query which is text
only or it could also be an image, but.
In the compiler paper, they only touch
on you having a text query and you have
some documents you want to search in.
And then you do the late interaction or
the MaxSim scoring, which essentially
computes a matrix a matrix multiplication.
Nicolay Gerold: Yeah, and what I was
always curious about is I would love
to hear your intuition on is a certain
type of language more suited for that
retrieval of what is what you retrieve
that if I have a very like, visually
descriptive language that it might
have a Better chance at retrieving
the relevant parts of the document.
Jo Bergum: By language, what, do you
mean German or English, or you mean
like in, inside a particular language?
Like
Nicolay Gerold: now it really
like in how I phrase the query
not like what language am I using?
Jo Bergum: Yeah, I think it's like
any system, this will depend heavily
on the data that it was trained on.
So ColPali is a, is more of a direction.
It hasn't been trained on a lot of
data, but I see as the most exciting
development in retrieval, because
when you're building these systems.
You're not dealing with this clean,
pre processed queries and text like
they're giving up front, like in
a information retrieval benchmark,
you will be dealing with clean text.
You will be dealing with some clean
queries, here's a clean document,
is the document relevant or not?
But in the industry practices, they
need to first, they need to take the
documents that they want to search in,
They need to convert it into some kind
of format, text extraction, whatnot.
And maybe then you're doing like
this chunk and embed and so forth
and vector search and whatnot.
Which is, the whole process is
quite complex and error prone.
And in a lot of cases, these
kind of steps are introducing new
errors, you're not extracting.
You're not extracting.
All the elements of the document,
you're not extracting the figures,
you're not extracting the charts.
The layout is totally off, you
don't know about like how actually
this page is rendered, like with
HTML or the types of format.
And that's the refreshing thing
about ColPali is that it builds
on this development where we have
vision language models that have
really strong OCR capabilities.
That they actually can, not just be good
at searching for cats or things like that,
but they're actually good at looking at
text and figures and, in, in documents.
And they build on that, and they build
on polygamma, and they say, Okay what's
a good way to take the approach that we
saw worked really well for text domain,
which is Colbert, and use that with
this new capability to Do retrieval.
And I think that's it's so refreshing
and that's why I've been so excited
about this about this new type of model.
Nicolay Gerold: Yeah.
And like the industrial application,
I think it's also banks on like a
vision language model becoming better.
So we can actually extract
the information out of it.
Jo Bergum: Yeah, I really
recommend people to.
If you're not familiar with this is
to, use whatever service provider like
ChatGPT or whatever of these that allow
you to upload an image or something
like that and ask questions about,
and I think it's like really amazing
the progress over the last year.
And probably will come at me and
say, yeah, but this case didn't work.
It didn't extract the text perfectly.
But I think there's
amazing progress here and.
Especially around PDFs and
complex complex formats.
And it's even HTML, right?
Because HTML it's a mess when
you're trying to parse it.
You, you cannot really, you
need to render it in order to
see actually how it displays.
And where is the actual text centered?
And if you look at relevancy judgments
and so forth, if you're actually judging
if this HTML document is relevant
or not, the user is looking at the
screen, like how it's rendered, right?
And you might have the position of
the text might impact, if you're, if
you think this is relevant or not.
So going from this messy text
representations, extractions, instead,
just take a screenshot of the page, feed
that through PoliGamma or any other of
these vision capable models, and then
have a representation of the text.
Which you can use for retrieval.
I think that's pretty amazing.
Nicolay Gerold: yeah.
And what people tend to
forget is like OCR itself.
It's so messy.
Like extracting information from PDFs
always has been messy and also very
costly because you tend to extract
so much additional stuff from PDFs,
which in most cases you don't need
and when you can find it, identify
it and then extract it and use it,
whether it's in a batch or a streaming
process, it saves you like much effort.
Jo Bergum: Yeah I'm, And it's.
And I also liked the aspect that.
It's in a way, How we also as humans, we
look at the page, we look at the content.
Things through our eyes, right?
The vision capabilities.
And then we interpret this and
we take into account the word.
So I'm looking right now at the
sign, like a stop sign where I
clearly can see, cars can travel
here, but there's also text in here.
It says, doesn't apply to buses or taxis.
So there's a whole both a vision
and there's the text there.
And.
ColPali combines these two into one
retrieval that you can use for retrieval.
Of course, people ask me, yeah, but what's
the difference between this and using a
vision capable language model like GPT 4?
The difference is that for GPT 4,
you will have to take the documents
and you will upload all the
documents and all the screenshots.
You will And then you will ask
it, the question or what not,
the information need, right?
So it would be more like a cross
encoder that we discussed because
you will have interaction with this.
What ColPali does is that it
learns or brings a representation
that we can use for retrieval.
So let's say we have 1 million PDFs.
You might not want to upload
that to GPT 4 for every query.
This enables you to search across this
data and then you can use a larger model
For the kind of re ranking or extraction
phases or question answering, right?
In the context of a more
like a rank pipeline.
Nicolay Gerold: Yeah, I think
it's the same discussion
like rag versus long context.
Like it's just wasteful and it makes
the LLM or the LLM less reliable when
I feed in like Lots of stuff which
likely isn't relevant to the query
Jo Bergum: Yeah, but I think
it's a good discussion to have.
Because I think essentially there
were a lot of excitement about
chat with your PDFs and whatnot.
And over the last year we went
from like 8k context lengths to
1 million or 2 million, right?
I think it's an important
discussion to have.
I'm not particularly a fan of the rag
versus fine tuning and whatnot, because,
I think fine tuning is dead in the sense
of large language models that I see the
beauty of them in the capability that
they have to follow instructions across
so many different types of tasks, right?
Instead of dealing with fine tuning,
following this recipe errors,
whatnot, just use a capable frontier
model and the prices have come down.
The context windows are longer.
So I think it's an important discussion
to, to be had, around this, what actually
changes with the increased context window.
And it means that you can do a lot more
now than you could before, but still
there are use cases where you want to
retrieve and do a retrieval phase before
you start stuffing things into the prompt.
Yeah.
Nicolay Gerold: yeah And maybe to
go more into the technical details
of ColPali, especially what is
actually the importance of using?
At the same time, page level MaxSim
and the cross page MaxSim in ColPali.
And can you maybe also quickly touch
on MaxSim, what it does and what it is?
Jo Bergum: Yeah.
So let's touch on MaxSim or
the late interaction mechanism.
So essentially on the query side for each
of the query tokens, you have one vector.
Okay.
And for each in MaxSim you have
one vector representation per
what they call grid cell or patch.
So the image is divided into cells
and you have one vector each.
Then you take for each
of the query vectors.
you do the dot product with all the
patch vectors or grid cells vectors.
And then you find the ColPali
of that query token vector.
And then you keep that, you store that,
and then you sum and do the same for
all the query token vectors, and compare
them using the max, and then you're
summing, summarizing this as the relevancy
score of the, score of this document.
And call poly.
Is trained on single screenshots
of PDF pages, not documents.
So when I did a blog post about ColPali
and how to represent MaxSim I flexed a
little bit on the capabilities of Vespa
of the tensor framework that will allow
you to not only index one document one
Vespa document, one one page, but that
you can represent the whole document in,
in, in actually one Vespa document, so
that you could actually do interaction,
not only per page, but you could have
interaction across the entire PDF.
But I think it might be a mistake by me
to introduce all this and flex all the
capabilities because people like it.
What's going on here?
In my upcoming blog post I'm working on
ColPali, I trimmed it down so that we
have actually one page, PDF pages, one
Vespa document to, to try to reduce the
complexity, both in compute and also in
actually what, what's going on there.
Yeah.
Nicolay Gerold: The one thing like
that I had immediately in mind as
a limitation of ColPali is actually
like, how do I run validations?
And especially like in text, I can run
factuality checks and stuff like that.
But in a visual domain,
this is really challenging.
What would you say like on that one?
Jo Bergum: Yeah I think
it's an important aspect.
So what the authors did they used a
lot of capable frontier models like
GPT some of the Antropic models that
also have visual capabilities to.
When you're retrieving documents and then
evaluating, you use a more capable model
to say if that actually is relevant.
So more like a visual language model
as judge process that we've already
seen becoming quite popular with
regular text domain where you're using
large language models as the judge,
as long as you calibrate so forth.
Calibrate the prompt a little
bit so that it aligns with
the kind of human preferences.
So you need to have a baseline of
human preferences, then you align this.
I think you can do the same thing
with a vision capable language model.
Nicolay Gerold: Yeah.
And same for like the NLI models,
if you have two types of texts
as input, you can do the same for
a text and an image in the end.
Would be an interesting
trial to do stuff like that.
What is a challenge that might
arise when you scale ColPali up?
Like you have an enterprise
settings, you have millions of
documents sitting in data lakes.
Is.
Is this possible right now, or
what would it take to scale it up?
Jo Bergum: yeah.
So this is definitely possible.
And I will always get this question and
they asked will it work for 1 billion
documents or something like that?
Which goes back to you probably
are not building out Google
infrastructure right now, right?
So I can have ColPali do 1
million documents in my laptop.
So that's really not because,
but ColPali and Colbert has.
several orders more of compute compared
to a single vector representation
because they are not using just one
vector to represent the entire page.
It's using multiple vectors.
And you're interacting with
all the query token vectors.
So there's a lot more compute
involved in ColPali than a regular one
vector representation of the object.
So what you would do typically
to have low latency is that you
paralyze this so that you use more
search threads in Vespa per query so
that you can get the latency down.
But for scaling for a very
large corpus, you would need to
divide Copali into two phases.
You would have to introduce like a
retrieval phase where you're retrieving
candidates, and then you can re rank
using ColPali either there's only
signal or also in combination with other
types of signals that you want to use.
And one thing that surprised me a little
bit with ColPali and the benchmark that
they introduced, the video benchmark,
is that even if the text, even if the
queries and documents, or the documents
were like text heavy, it didn't have a lot
of figures and so forth there was still.
Beating not by a large margin but
beating like very good baselines like
BM25 M3 and so forth with chunking
and multiple vectors, which is
really demonstrating how good or the
capabilities of the language models that
are paired with vision capabilities.
Yeah.
Yeah.
Nicolay Gerold: do you think
we can still improve this?
Is the ColPali approach splitting
it up is it like naive chunking,
that I'm just going by length?
Can we figure out a smarter
approach to do that?
Jo Bergum: Yeah, I'm lucky that I'm
not so smart that I'm working on neural
network architectures or things like that.
But what we can do with ColPali that we
know works pretty well is two things.
We can reduce the number of vectors.
That we represent on the document side.
So instead of having thousand
vectors per PDF page, we might
be able to reduce it to 300.
And we tried this in our
internal evaluations of ColPali.
And also the authors have now
tried this and they don't get
the accuracy drop is very small.
When you, and that's the good thing with
having a benchmark, or having your own
relevancy dataset, is that, if I do this
tweaking, how will it impact the quality?
So that's one thing.
So then you're essentially
reducing the cost by 3x.
So if you have a single thread,
it's three times faster.
Then the other thing and that's something
that I've been actively working on
over the last week is to accelerate.
So instead of doing a dot product,
a floating point dot product, we can
instead look at a bitwise hamming.
And we have some
interesting results there.
And this speeds up the actual
process quite a lot, compared
to float representations.
And then you have to go back and
then look at, okay, how's the,
how's that impacting quality?
If you're looking at if in a float range,
you have the full range of each dimension,
but most of them are just around zero.
So if you instead represent.
If it's positive, you represent
it as a bit which is set or not,
depending on the float value.
And this approximation works really well
with Colbert, and it works really well
also, it turns out, with ColPali that's
something that I'm excited about sharing
more about in the coming weeks, actually.
Nicolay Gerold: Nice, I'm
really interested in that.
How would you actually go about, if you
come from an industrial setting and you
want to apply ColPali in other domains or
other data types, like medical images, for
example, how, what would be necessary to
basically fine tune ColPali on a new task?
Jo Bergum: Yeah, this is a great question.
I see ColPali more of a direction of a
model snapshot in time that has really
demonstrated to us a future direction
of retrieval over complex formats.
And if you or your domain, if you
take the checkpoint that was trained
on PDFs and try to apply that to,
image data of medical or whatnot,
that, that's not going to work.
It's going to be like any other technique
that you're applying out of domain.
They do include like how you
actually fine tune, how you go about
in the repo that they published.
So you will use that, but then
you will have to mix this up with
training data for your domain, right?
So what you're actually trying to solve,
to train, to have a better checkpoint.
It's not like a general checkpoint that.
will work across a lot of different
domains, but I see it personally as
a fantastic direction of retrieval
over these complex document formats.
Nicolay Gerold: Yeah, I'm, I think
especially the medical use cases you
can do with that are really interesting.
Also, cellulite images are like, are
interesting to approach with this.
What are some of the more interesting
use cases you've already seen
implemented or you've tried yourself?
Jo Bergum: Yeah, so personally,
I only looked at the checkpoint
and the model that, the intended
use and it's been trained on PDF.
So that's what my focus has been on.
And I, but I think this approach
is promising for also for web
search, for HTML, because it's
also a format that needs rendering
to actually see what's going on.
You avoid things like some of the
text is hidden for CEO, spam, whatnot.
You don't have the menu items.
You can.
Put weight into what's actually
in the center of the page.
Yeah, in summary, all kinds of
complex formats that are really
needs to be rendered or printed.
I think that's where ColPali will shine.
Nicolay Gerold: Yeah.
And I think we're starting
to close out right now.
What I always ask is basically
overrated, underrated.
What's an underrated technology?
What's an overrated
technology at the moment?
Jo Bergum: Personally, if we go
underrated meaning that is under
hyped, I would say baseline BM 25.
Okay that's a good starting point for
a lot of different retrieval aspects.
So I think that's underrated.
Overrated in terms of retrieval.
I think some of the 7 billion parameter
embedding models that are 0.2 nGCD points
ahead of a much, much smaller model with
some much, much smaller dimensionality.
Is overrated.
Yeah.
Nicolay Gerold: And what is actually
something that you would love to
see built in Retrieval or around it?
What would actually make your
life way easier, especially?
Jo Bergum: Oh, I think ColPali will
be a fantastic to see more model
checkpoints, larger interest in the
community because it eliminates so
many of the pain points around indexing
extraction that I really would like to
see more like cold poly oriented models.
We see now.
More and more vision
capable language models.
There is also Phi from Microsoft, 3.
5 with Vision Instruct
with long sequences.
I love to see someone use that to
build ColPhi, or things like that.
Yeah, I really believe in that direction.
So I'm hoping to see more of that
because it avoids so much complexity
on the document processing side.
Nicolay Gerold: Yeah.
And what's next for you?
What's next at Vespa?
What can you teaser?
Jo Bergum: Personally, I'm so excited
about ColPali that I'm all in on
ColPali and I also all in on how do we
accelerate MaxSim and late interaction?
How can I?
touch on these questions that you brought
on terms of scaling, what does it mean
that we have this amount of vectors?
How can we accelerate that?
How can we make it practical?
How can we make it work
with 1 million documents?
How can we make it work with 1
billion documents and so forth?
Because I got a lot of questions around
this, so that, that's the near, near term.
I also think there's raised awareness
now that a single vector model.
That you represent the entire object
with a single vector is not the
greatest representation that there
are multiple aspects of a document.
And these kind of hybrid
capabilities of Vespa.
So that's something on the
Vespa side that we are still
investing a lot in, bringing this.
Together into one platform.
So you can start with BM25 with
Vespa, and then you can move on
to more advanced cases as your
knowledge and also your evaluation
strategies and so forth increases.
Then you can actually start
exploring with more kind of fancier
techniques, if you can call it that.
I
Nicolay Gerold: Yeah.
And if people want to start
building the stuff we just talked
about, where would you point them?
Jo Bergum: to I am biased of course, but I
Nicolay Gerold: would like them
to go to the blog at vespa.ai
Jo Bergum: And read some of my blog posts,
maybe check out some of the tutorials
or the Python notebooks that we've done.
Hang out on Slack and
yell at me at Twitter.
I'm quite active on X or on Twitter,
so you can yell at me there.
Okay, what can we take away?
I think.
The excitement of having AI.
that can.
Navigate look at documents
and pages and whatever.
More like a human is,
is very interesting and.
We are basically moving away from
having a large pre-processing
pipeline for every type of document.
But rather using something, I ColPali
to find the relevant information
and then either using a VLM two.
Get the information or merely
answer questions about it.
Or run the post processing
or the extraction pipeline
only on the relevant parts.
Which, especially if you have
large corpi of documents, will
save you a lot of time and.
To, to make it a little bit more tangible.
Like what could you use this for?
What I'm at the moment working on is
using it for especially annual report.
So long financial documents, but
you have a lot of graphs, a lot of
tables, but also a lot of texts.
And most of the time for me, the
interesting information is rather
hidden in one off the table or one
of the graphs and not in the texts.
So.
I basically need to find them
and then extract the information.
But because in an annual report, there are
often like 20 different tables at least.
For different types of information.
So I basically want to
identify the relevant table.
And then basically extract the table.
post-process and use it
further down the road.
And I think you can apply that
to a lot of different areas.
So for example, medical records,
government reports, but also
legal documents, academic papers.
I have like a whole list of those.
Um, product catalogs.
Technical menu.
It's like everywhere.
You have a mix of images, tables, text.
You might be better off using something
like ColPali, especially down the road
as the vision language models get better.
And not doing all the documented
extraction, but at the same time,
when you have like predominantly
texts and the information you need
in your application is in the tax.
You should probably stick.
With the current paradigm.
And run an extraction because
take six direction with
something like AWS texts extract.
It's pretty good.
One more exciting area.
Is.
Something Scott Belsky.
The.
Chief.
I'm not sure what's his
position, but he's at Adobe.
Um, mentioned is basically
having a better onboarding
experience for software products.
And I think using something like
ColPali and Vision Language Models.
Could be very interesting for that
because they, these models are.
Basically looking at the software
products, like a human world.
So they might be more
interesting for actually.
Demonstrating the different use cases
and especially answering specific
questions the user has or guiding him.
to of the specific target and why?
I think this is relevant because most
of them in software it's important
when, when he signs up to the product.
To get him to his target as fast as
possible, or to value as fast as possible.
And when you, I can
guide him specifically.
On the goal he has, this
could be very interesting.
But, yeah, so.
For now I think like ColPali.
Has.
So much potential.
Whether it's already there in terms of
like applying it as and painting models
are, I don't think so because embedding
models are pretty optimized already.
Like you have so many different indices.
Which optimize for the retrieval.
And ColPali.
Doesn't have that yet.
They're already working on
it, like, especially Jo at
Vespa with his Bitwise hamming.
But we will probably see so much
development that probably in like
the next two months, we will see like
some form of efficient computation.
Over a ColPali embeddings,
and an efficient storage.
And yeah.
I'm excited for it.
If you want to try out
like any of the things.
So if you have an idea and want
to get my feedback, just hit me up
on LinkedIn, on Twitter, wherever.
And yeah, I'm really excited.
I will probably share some
guides of what I'm doing.
Maybe you do a demo with coal,
poly of hire would use it.
May four cookbooks even
or something like that.
Listen to How AI Is Built using one of many popular podcasting apps or directories.