· 52:12
Nicolay Gerold: When you store
vectors, each number usually takes
up 32 bits within the vector.
So with thousand numbers per
vector and millions of vectors,
the cost really explode.
A simple chat bot can cost thousands
per month just to store and search
through the vectors, through
your embeddings in the database.
The fix.
It's quantization, and you can think
of quantization like image compression.
JPEGs look almost as good as raw
photos, but take up far less space.
And quantization does
the same for vectors.
The trade off is, you lose some accuracy.
But often, that loss doesn't
matter or is very negligent.
Just like the Slight loss
in quality in a JPEG.
And today we're back continuing our
series on search with Sahin Hassan.
Sahin is a former ML engineer at the
Vector Database Weaviate and now a
senior AI and ML engineer at Together.
And we talk about the different types
of quantization, when to use them,
how to use them, and their trade offs.
Let's do it.
Zain Hasan: When people think
about vector databases, they think
about them with regards to ChatGPT.
So they think, oh, vector databases
are a way to store my text data
and then chunk it and then retrieve
relevant and then give it to relevant
pieces and give it to ChatGPT.
But vector databases can handle text
image, audio, video data just as well.
Right?
If you have a machine learning model
that can encode that data, then a
vector database doesn't care because
the vector of an image is pretty much
the same as the vector of the text.
All of the mathematical computations
for that vector will be the same.
So going from text to multimedia in
vector databases is really simple.
If you have a machine learning model
which there are a lot of open source ones.
And then the second thing
I think is quantization.
So a lot of people think that vector
databases are this really expensive
thing that you know, it's like
this fine, fine dining cuisine.
You go to a restaurant, you have
to pay like a thousand dollars
a pop or something like that.
A lot of people have this view in
the field that, oh, you're doing
vector search, it must cost a lot.
I don't think a lot of people
are using vector quantization.
Right.
If you're using the entire vector that
comes out of the the the model that can
end up costing a lot, but oftentimes
you don't need the entire thing.
You can quantize it.
And we can talk about this and you
can cut down your cost by by 90%.
80 percent just by quantizing, depending
on how much accuracy you want, right?
So there's a lot of kind of tuning.
Of costs, speed memory footprint
that I don't think a lot of
people are utilizing right now.
It's still a new new field, but I think
those are the couple of things that a
lot of people are surprised to hear.
Nicolay Gerold: What do you think
quantization wise, what are the
trade offs you're actually doing?
Do you have a matrix you're looking
at, or a different set of metrics?
Zain Hasan: Yeah.
So basically when, when you're
evaluating or looking at how good the
vector search component of your RAG
stack is, there's like two or three
things that you're really looking at.
You're looking at how fast this thing is.
So how many queries can
you can you can you serve?
So you could think of that as throughput.
Another thing that you're thinking
about is what is the the total,
what do you call it, the memory
footprint of your index, right?
So if I've got a million objects
that are in my vector database
how much in memory cost does that,
does that amount to and then you're
also looking at the the recall.
If you're passing in a query, are you
even getting back relevant things?
And so usually those three
things you're balancing.
If you have a, if you have a application
where recall is the most important
thing, then you might want to sacrifice
latency and memory memory consumption.
to boost up recall to
the highest possible.
Okay, so there's a lot of things that
allow you to kind of fine tune and
balance between those three things.
Vector quantization is one thing, right?
So vector quantization basically says
instead of taking So let's say you
have a thousand dimensional vector.
Every single dimension of that vector is
a 32 bit floating point number, right?
So every dimension is four bytes.
And so if you have a thousand dimensions,
that's 4, 000 bytes per vector that's
stored, and then that's scales up.
If you have a million, then
you've got you've got a lot of
in memory requirements, right?
Vector quantization basically says.
Are you willing to let go of
some of those bits in order to
reduce the memory footprint?
And so, now if you can Turn you if you
can go from four bytes to four bits or
four bytes to one bit Then you can you've
basically Gotten rid of there's a whole
spectrum of information that you can
get rid of And you're reducing memory
footprint, but at the cost of recall right
because now your vectors are becoming
fuzzy You don't know where they exactly
are you're using less less memory to
capture the location of every vector.
And so there's people have found different
techniques where what is the right
amount of information to get rid of?
How to get rid of it?
What does your distribution of
the data need to look like before
you apply vector quantization?
So that it still works afterwards.
And so there's a whole kind of school
of methods, product quantization, binary
quantization all of these different
methods that people are now working with.
So you mentioned mixed bread.
Cohere was also working on this.
I think binary quantization is something
that people are really excited about
now because it's kind of amazing to
think that you could have a thousand
dimensional vector and then you could
only keep one bit per dimension.
So it's almost like saying, You know,
let's say there was a party and you
describe to me the location of that party.
Give me the total address and
now you tell me and I'm not going
to give you the total address.
I'm just going to tell you every
time you come, come to a street,
whether you go left or right, and
that's all I'm going to tell you.
I'm just going to tell you consecutively
one after the other left, left, right,
right, left, right, left, right.
Like you could do this and you could
still capture vector distances pretty
accurately which is pretty amazing, right?
So I think people are correctly
excited about binary quantization.
Nicolay Gerold: What are the other levels
you could pull to basically move along
the different metrics you just mentioned?
Zain Hasan: Yeah.
So think about a full vector is
32 bits per dimension, right?
So you've got all of that accuracy to
capture exactly how far along to move
of the axis of a dimension, right?
So if you have a 2D vector, you've got
32 bits to represent how far along the X
axis and how far along the Y axis to move.
So you can exactly locate
yourself in a, in a 2D 3D.
Space, like a, like a Cartesian
coordinate system, you can
exactly locate the vector there.
If you binarize that vector,
now you've only got a zero
or a one for both dimensions.
So now you're either in the
first quadrant, the, the you're,
you're basically in one of four
quadrants, and every vector has
to be in one of four quadrants.
So, This is the problem
with binary quantization.
If you reduce the number of
dimensions, it doesn't work.
But for 1000 dimensions, it works
really well because you can have
you can have 1000 squared possible
number of different unique vectors.
And that exponential blows up much
faster than the total number of
vectors that a lot of people have.
So this is why binary quantization
is working quite well.
If you find that.
Binary quantization doesn't
work, or maybe your distribution
of the data is not good, right?
Because what you do in binary
quantization is you say, is the
dimension positive or negative?
If it's positive, you
turn it into a one bit.
If it's negative, you
turn it into a zero bit.
If the, if the data is.
Not kind of normally distributed.
Let's say you don't have an
equal distribution between
negatives and positives.
Everything is a positive, then all
of the vectors will just be ones.
So then binary quantization doesn't work.
So you have to be careful about the
distribution of your data before
you apply binary quantization.
If you're finding binary quantization
is not working, then in Weaviate,
we're actually we've implemented a lot
of different quantization techniques.
So you could try something
like product quantization.
Product quantization is slightly more
complicated than binary quantization,
because it takes the vector.
It cuts it up, and then it says, okay,
I'm going to cluster this segment of all
of my vectors together, and I'm going to
find some centroid that represents this.
And then it does this
for every single segment.
And now, anytime a new vector comes in,
Take the first segment and you say, okay,
instead of representing it with these
numbers, I'm just going to assign it to
the ID of the cluster that I use for this
segment and I've got so many segments.
So now I only need to remember
what those cluster IDs are.
And and that's all right.
So we started with product quantization.
Then we went to binary quantization.
Now we're also looking at a couple
of other quantization techniques.
Scalar quantization is another one
where you simply take the 32 bit vector
and you reduce it down to eight bits.
So there's a lot of different things
that people are trying out, but, one of
the the reason why this reduces cost.
The reason why I said people are not
using it as much as they should is that
it's almost like a superpower, right?
You start to think that vector
using vector databases will
cost you thousands of dollars.
And then you apply quantization.
And not only does it cut
down the memory cost.
But you can also run like, depending on
how many objects you have, you can run it
in you can run brute force search, right?
If you've got 10, 000 objects.
Or 50, 000 objects.
You use binary quantization.
You can run in memory search.
You don't need to build
approximate nearest neighbors
indexes or anything like that.
So this is another feature that we
implemented where given a certain number
of objects, you can just Vectorize them,
then we binary quantize them, and then
we do on what do you call it we call
it flat indexing, where we don't even
build up the HNSW graph, we just do
brute force comparisons, and because the
binary binary vectors you can optimize
the distance comparison between them.
It's quite efficient and you can
get away with with just brute
force searching through them up
to a certain number of objects.
Nicolay Gerold: I think quantization
It builds on the assumption that there
is still noise in the embeddings.
And if I'm looking, it's pretty
similar as well to the thought
of Matryoshka embeddings that I
can reduce the dimensionality.
What do you think what technique
has more potential quantization
or Matryoshka embeddings?
Zain Hasan: Yeah, I, I would say that
Matryoshka embeddings are kind of like
a type of quantization as well because
In this type, so this is an interesting
question because in the case of
Matryoshka embeddings, what people are
doing are basically changing the loss
function so that the learned vectors
the distribution of information in the
learned vectors is not all the same.
So when you, when you use a off the shelf
embedding model, you can't really say,
okay, dimension 57 is really important, so
make sure you never quantize that, right?
Pretty much every dimension
is equally important.
And when you look at the variance of
of numbers in the dimensions, they're
all pretty much the same, which is
evidence towards the fact that, oh,
all the dimensions are capturing
an equal amount of information.
And when you apply binary
quantization or PQ on top of those.
You, you take that into account, right?
You don't you don't squish one
dimension more than the other
dimension because they're all equal.
MRL, matryoshka representation
learning changes this, right?
It says the first X dimensions
are much more important.
The next.
The, the two end dimensions are less
important and then you keep on going.
So by the time you get to the end the
dimensions are not that important.
So it's almost like saying the first a
hundred, 500 dimensions will give you all
the information you need, and then the
next 500 will give you a little bit of
details around those main main dimensions.
I think this is quite a, kind of a unique.
Way I spoke to the, the first author
of that paper, and he said that the
reason the, the motivation behind that
paper was that he didn't have enough
compute when he was running experiments.
So he wanted to train four models.
So he nested the loss function for
those four models together, and
out came these MRL vectors, right,
where you've got the first 500, the
next 1000, the 1,500, and they're.
Because you've nested the loss, you've
got like different amounts of information
being captured by every dimension.
And it's interesting now that this
technique is being used so widely, but in
going back to your original question of.
Which one I think is more promising
I think probably normal, like binary
quantization and things like this,
because it doesn't require the
initial embedding to be trained
in any particular way, right?
I can just go in and apply PQ,
or I can go in and apply BQ and
it will be, it will be okay.
Now, having said that, Cohere did
some work and they, they, their
new embedding models are Cohere.
quantization aware, right?
Their loss function has binary
quantization, and and they also have
scalar quantization into account.
And They prove that if you train
the embedding models to know about
quantization, you can actually
keep a lot more of the accuracy.
So this is going more towards
what MRL was doing, right?
They're letting the embedding
models know to optimize through
changing the loss function.
Yeah, a lot of people right
now are just using these PQ BQ
techniques without the loss function
even knowing about it, right?
And this is why I think it's a little
bit more doable where in Weaviate you
can put your data in and you can say,
okay, just enable BQ or enable PQ.
If the embedding model was trained
with the quantization technique
in mind, you'll lose less recall.
But for MRL there's no way you can,
you can just enable MRL, right?
If you, if I have a vector here
that has no idea of MRL, I can't
just say, you know, now it's MRL.
You kind of have to fine tune the
model with an MRL loss function to turn
those vectors into matryoshka vectors,
which I think is a, is a bigger barrier
for people to, to use MRL as opposed
to just saying, you know, configure
a vector database to use BQ or PQ.
That I think is a lot more easier.
If OpenAI is now providing MRL vectors
that, of course, you don't have to
worry about it then, and you can just
start chunking it that I think a lot
of people are using those, and they're
discovering the power of that, right?
The other cool thing that you mentioned
before this was the multi vector
support I really, I'm really interested
by this idea of how MRL, Can take
advantage of multi vector support
to do adaptive retrieval, right?
That's another really fascinating
thing that's happening in the
field where people are doing.
Search on on like smaller vectors
that capture more information.
And then if you want to get more
accuracy, you take a bigger piece of
the vector and you redo the search,
take an even bigger piece and you redo
the search, depending on how, how much
time you have, depending on how much
time you're willing to invest for that
query versus how much accuracy you
actually want out of that query as well.
So I think.
That's another really interesting
topic that is unexplored right now,
but a lot of people are moving towards.
Nicolay Gerold: Yes, and
especially looking at embeddings.
I think I still sometimes
it's like the vector store.
I'm still sometimes torn between a
feature store and the document store.
Like, how do I transfer each of
those models onto a vector database?
How would you place like a
vector database in the general?
Database sphere and how does it
delineate from feature stores and how
does it delineate from document stores?
Zain Hasan: Yeah, so I would say a
vector database This is something
that we were initially in the company
talking about, like a vector search
engine versus a vector database, and
then an AI native vector database.
If you think about a, an AI native
vector database, essentially what we mean
is all of the, you've, you've got the
database kind of core component, but.
All of the AI related stuff, like the,
the inference, the embedding model that's
required to generate the vectors also is
connected to the, to the database, right?
The quantization models are
also connected to the database.
The language model that you might
eventually be using to generate
off of what you retrieve can
also be easily hooked in and
integrated into the database, right?
So if you think about this idea
of an AI native, Vector database.
This is what we mean by it.
Everything that you need to build the
AI app should have a easy integration
or either should should have an easy
integration into the database or should
live in the database itself, right?
Or be connected to the database.
If you think about if you think about
it from that perspective, I think
this is kind of a core infrastructure
that is touching all of the other
parts of your AI native tech stack.
Whereas a feature store, I think
is probably a long term store
for your vector embeddings.
If you have a data engineering pipeline
and you're taking the those features
and you don't really want to perform
vector search over them, right?
You don't want if you store it in the
vector database, then you're going to
be paying for that there are ways to get
around that, but like, you don't just
want to use a vector database to store
your data and never search over it, right?
That's what a feature store is for.
If you want to take your data
and then you index that you store
in a vector database and you're.
Performing a large number of queries on
your actively using that data, actively
updating that data, modifying it as
users are interacting with your platform
and changing the representations.
That's what I think of vector database is
Nicolay Gerold: Yep.
Are there any plans to actually
support like pruning of the database,
like removing unused vectors or
even integrating something like
foreign data wrappers that I can
hook into a cloud storage where like
older embedding vectors are stored?
Zain Hasan: Yeah.
So currently.
We do the way we get rid of this is
something that we're actually changing.
Now, but we, when you delete a vector,
because a vector database is like it has
all the capabilities of a normal database.
You can do all of your crud
operations when you delete a vector.
If you have an HNSW index, it's not as
easy as just kind of disappearing that
vector because that vector is connected
to a bunch of the it's a proximity graph.
So it's connected to a lot of
the data points in the proximity
of that of that vector, right?
So getting rid of this vector means that
you have to update the dependencies of
this node on all of these other nodes.
So you have to either connect them to
each other because now the sky is gone.
So that algorithm is
already in place, right?
So when you delete a vector
that removing that from the
from the graph is already there.
I'm not sure what you mean by pruning.
Is that, is that what you
meant or did you mean?
Exactly.
So that's always been there.
That's like a core functionality
of vector databases.
We are updating it and
making it more efficient now.
Before we used to just reconnect.
So let's say you had three nodes,
one, two, and three, if you wanted
to delete this guy, we would just
take the connection from this guy and
this guy and we would connect them.
And now because there's no
connections here, there's no way
for you to, as you're kind of spider
webbing, jumping from node to node.
There's no way for you to go
from this node to this node.
There's only a connection
between this node and this node.
So now this guy might as well not exist.
It's still there physically, but
it's not, it's not accessible.
And then when we have some
downtime, we just remove all
those kind of unreferenced nodes.
But now we're actually making that
algorithm a lot more efficient.
Nicolay Gerold: nice.
And what is really interesting to me
the focus on recall, especially versus
precision, what Are actually the levers
I could pull in a vector database to
also bring precision into the picture
beyond like the similarity threshold.
Zain Hasan: Yeah, so I think it depends
on, it depends on your application, right?
There's a lot of different levers,
so We've talked about the HNSW index.
You can modify the parameters
of the HNSW index and that will
give you some some control over
over precision outside of that.
There's a lot of other things that a
lot that people are doing now, and this
is because of the popularity of rag.
People are realizing that the R part
is Just as important, if not more
important than the generation part
of their ragstack, because if you're
our part sucks, then the generation
will be full of hallucinations.
So people are coming up with all
these techniques to do retrieval.
Initially, it was the conversation was
mainly around chunking and then people.
That realized, like, what if we have huge
documents and we, there really isn't a
good way to do semantic level chunking
over those, how do we capture those?
And then people realize, okay,
you have to use classical
keyword search in those cases.
If you have gigantic kind
of textbooks there really is
no way to do single vector.
You have to either do multi vector
support or you have to fall back onto
things like BM25, where you're doing
keyword searches through those and then
people said, okay, let's say we combine
vector search with with keyword search.
We can use hybrid search and
hybrid search gives you really
good improvements in recall, right?
So pretty much all of the use
cases that we're seeing take
advantage of hybrid search.
Because there you, you have this kind
of fine tuning parameter to say how
much vector search do you want and how
much keyword search do you want, right?
I think that's probably the biggest boost.
Another boost is using re ranking, right?
A lot of these models in vector search
they're relative in the sense that
the distances you measure, like in
one model, a cosine similarity of 0.
9 might be considered.
you know, very high.
But in another model, if you use a
different embedding model that was trained
differently, a cosine similarity of 0.
6 might be considered super high, right?
So these distances and then if
you're looking at dot products, in
that case there is no bound, right?
So then the threshold of similarity,
it depends entirely on your, on
the distribution of your data.
So you have to kind of see.
The the average distance between vectors
and you say, okay, anything above this
or below this will be considered similar.
So I think they're using these more
powerful models, like re ranking
models you can really use that to
increase increase position as well.
So this is another thing that we've
recently been adding to in Weaviate.
Because we want all of these different
tools to be accessible from the user
of the database, we've been adding
modules that allow you to say, do
hybrid search, but not only do hybrid
search, but now that you've gotten
like 25 things back that are relevant,
do a re ranking on these 25 things.
So you can call a re ranking call on top
of the hybrid search call, and that's
going to go through your 25 objects.
And then it's going to compute
individual scores of the query and
all of these one by one and then
it's going to re rank those for you.
So I think that's another cheat code
that a lot of people are not using.
And it's a very simple kind of
add on to your current workflow.
Instead of just doing vector search
you can just do vector search
with re ranking and you see an
immediate boost in, in position.
Nicolay Gerold: Yes.
Do you already support that I can
search with a different multivector
system, that I can search with a
different vector in another vector?
If I, for example, have two vectors
in a multivector system, like one
for the entire document and one for a
summary, can I use the summary vector
to search in the entire document
vector store or in the vectors?
Zain Hasan: That's a good question,
but so before I answer the question,
the assumption is that the machine
learning model that's used to generate
the entire summary vector and a
chunk of that are exactly the same.
So currently we just added
a multi vector support.
I'm not too sure whether it's possible
to search, oh, actually, you can.
So now that I'm thinking about it.
So let's say you configure
the database, right?
You start up with a you configure it.
You want to two named vectors.
You've got summary vectors, and you've got
chunks from the original document, right?
You've got those two indices.
Now it's going to go in and build up an
HNSW graph of summary and one of Chunks
that you've extracted from the document.
And now let's say you come in
with a with a summary, right?
If you go in with the summary and you go
to the first named vector, the summary
index, you'll get back other summaries.
But if you take that summary, you
vectorize it with the same module because
these two two indices are vectorized
using the same embedding model.
You can just take the summary and
search it Search the chunk index
over this, and this allows you to
do the summary to chunk search.
And if you do the other way, if you have
chunks and you want to say, okay, what
summary might this chunk have come from?
Do you just take the chunk, pass
it in as a query to this first name
vector, and you'll get back summaries.
Yeah, so.
Initially, I was confused because
I wasn't sure whether whether
that was a database level thing.
I think on the database, the
only thing that you need to
have is multi vector support.
The query itself can be rerouted
to one name vector or the other,
Nicolay Gerold: Yeah.
And this opens like way more
recommendation system use cases,
but also really fun rack use cases
like the adaptive one you mentioned
before that you actually can escalate
to like different presentation.
Nice.
How do you actually see vector
databases changing the way we actually
store and process and interact
with data over the next like three
to five years, like horizontal.
Zain Hasan: yeah, this vector
databases are pretty much the The
commercialization and the productization
of representation learning, right?
Initially, when I started at Weaviate, I
didn't know what vector databases were.
And then I started learning and I
realized that vector databases are
just representation learning scaled
up and and made available where
any company can use representation
learning to do search and retrieval.
And so this is super powerful because
machine learning models in PyTorch or like
from, from papers are very inaccessible
to a startup or to even companies that
don't have machine learning experience.
And where I feel vector databases
come in as kind of and save the
day is that they abstract away all
of these different things, right?
You can make smart default choices
about about the HNSW index parameters
what type of what type of index you
want to build what machine learning
model you want to use initially.
Like you can just go ahead with a
default one and see how well it works.
Of course, kind of Getting your feet
wet, fine tuning the embedding model
and doing all of this stuff will
improve the accuracy when you want to
go from zero to one, someone not using
vector search to using vector search.
I think the way vector databases will
change the landscape is that people
right now when they want to search.
They think very in a very
computer manner about how the
search needs to happen, right?
We look for words, or we
look for for matching kind of
patterns or something like this.
Vector databases allow you to take all
the advances that are happening in machine
learning and now just simply turn a
switch and use them for your application.
I think this is a really exciting thing.
Okay the, you mentioned
recommend recommendation systems.
I can go from searching for
recommendations over product
descriptions to now combining
that in a multi vector world with
searching for recommendations over the
images and how those products look.
I don't need to be a machine
learning expert to do that now.
In fact, we have an entire kind of
section of our team that are working on
educating JavaScript developers and web
developers on how to use vector databases.
Can you imagine, like, you don't, you
have, you don't need to have any machine
learning expertise to use it, right?
You don't even need to learn
Python now to, to access all
this machine learning knowledge.
And I think this is really the
power of vector databases, right?
I think it lives outside the
machine learning world where.
Everybody can now access really
smart human level search for
their, for their applications.
And so in five years, I have no idea
where the field is going, but in the
next year, two years, seeing people
take in vector search and use it
for their web apps, I think that's
going to, that's going to be very
Nicolay Gerold: What are maybe some
of the most interesting, but also
like weirdest use cases you have
seen implemented on top of VV eight?
Zain Hasan: Yeah, that's a good question.
So there is there is one particular
application where this person I
spoke with was looking at different
kind of palm, palm images.
And they were trying to see if they
could, because palmistry is a, is an art
and people are really interested in it.
They were trying to embed
images of people's palms.
And they were trying to see if they could.
Given your palm, situate your palm and
find out other people's palms and see
if they could find some sort of kind
of if they zoom out, see some mosaic
of, oh, these palms end up rich, or
these palms end up living a long life.
This was like, if you're asking about
the weirdest thing, this is, I think,
probably the, the most interesting and
weirdest application and it goes back
to kind of, Discovering high level
distributions, like semantics and data
where we, you know, people think that
there's something there, but let's embed
it and let's see if there is actually some
structure in how the way our palms look.
So I think that
Nicolay Gerold: What would be a
use case you would love to see
implemented on top of VVI eight?
Zain Hasan: Yeah.
I think there's a lot of potential
with recommender systems.
Like a lot of people are going in really
heavily into rag and it's understandable
because Like, I think it's a very exciting
space, but I think recommender systems
can definitely be revolutionized by
this as well, because If you combine
recommender systems with multimodality,
I think that completely changes how
recommendations can be made, right?
Like if you think about how
recommendations are made right
now, typically, you're looking
at product descriptions and how
they interact with products that
you've bought or interacted with.
So if the description is similar, then
you'll be recommended those things, right?
Tomorrow, what might happen,
and some companies are already
doing this, that they've.
Published meta with the with with their
Facebook Facebook marketplace, and Amazon
has already published on this where
they're doing multimodal recommender
systems where they take, they generate
vectors, which are combinations of images
and descriptions, right, text and images,
and then they tune these models, and
then they make recommendations based
on not just how a product is described,
but also the way the product looks.
And this allows you to recommend.
within a category much more accurately.
So right now if you if you search on
Amazon or Facebook Marketplace for a
couch and a very specific description,
you might get other couches, but
that ones that don't look the same.
But if you now have image as a
modality to recommend off of,
now you can differentiate between
different types of couches, right?
Because the description might be the same,
but the way they look might be different.
And so now you can More accurately
identify what a user likes and they don't
like because now you have another sense.
You have the sense of
sight to do that with.
And now if you extrapolate that
into the future, there's a company
that spun out of Google called Osmo
that's working on digitizing smell.
As well, right?
So now let's say this is not publicly
available right now, but they've
published some papers on this.
If you can embed smell, you
can already embed video.
If you can take all of these modalities
and A lot of these companies already have
their products in these modalities, right?
Their marketing teams know what a Coke
sounds like when you pour it or what
a burger sounds like when somebody
eats it because those marketing
assets are already there that they're
using in their advertisements.
Now, maybe like two, three years
into the future, you could search
and say, This is the perfect burger.
This is the crunchiness of the burger.
This is the perfect perfume because
I know it smells exactly like
this old kind of memory from my
childhood or something like that.
Right?
So now you've got recommendations on
those other sensory inputs, which is
kind of impossible to imagine right now.
I think that's what Vector
search can enable in a field.
There's some people, there's some
companies that we're talking to
and working with right now that
are taking advantage of this
multimodal recommender system stuff.
I think that will only improve.
And I think three, four years from now,
we'll be amazed by the recommendations
and what senses they're coming from.
I think that
Nicolay Gerold: Yeah, and especially
in recommendation system, you can even
extend it to the user side as well.
So one use case we are exploring at the
moment is for basically skincare, but
you have images of the user conditions.
You have the images of the creams of the
products, you have the description of
the products, you have the ingredients,
you have the description of the condition
of the user, and also the description of
the user, and through that you can do so
many different interesting things what
are you now basically matching on screen.
Bye.
And, but also the reverse if I can't find
anything that matches, this already points
me there might be an interesting anomaly,
and I might even hint it you should go
to a doctor and actually talk to them.
Zain Hasan: for sure.
One, one other interesting
thing that I That I was thinking
about for this is a lot of the.
A lot of these big corporations,
they, they want to know
consumer buying patterns, right?
If you're buying product X
versus buying product Y, why
did you buy X and not Y, right?
If you think about this multimodal
search a little bit more and if you
if we're thinking about which senses
a person is using to buy shoes versus
what senses they're using to buy a
soft drink, you can even, like, this is
completely academic right now, but maybe
in the future, if you have data around
which products were purchased and, and
which modality was more important for
that purchase you can do some testing
to see, okay, when we do soft drink
recommendations based on how they look
or how they sound, more people use them.
Based on those recommendations tend
to convert and buy versus if we
just do it based on descriptions or
nutritional facts, people don't buy it.
So now you can even understand which
products are bought using which senses.
Like I buy burgers because of the way
they taste, not because of how they're
described or something like that, right?
But clothes I buy because
of the way they look, right?
I don't really taste the clothes.
So there that that sense is not useful.
So I think you can kind of break down.
User intent across these modalities
down to a science which is also,
it's kind of scary, but it's also
interesting because now people are
used to looking at a YouTube ad 10 more
seconds than they do on average, and
all of their ads changing into that
type of ad now you have to be wary of.
Kind of how you're using your other
senses as well, because now if you
can capture that sense and embed it
into a vector, now that's potentially
a way to be recommended an item,
Nicolay Gerold: Yeah, we finally made it.
The tech weeks can actually
quantify psychology.
Zain Hasan: exactly.
Nicolay Gerold: is next for VBA?
What you already can tease a little bit.
Zain Hasan: Yeah.
So one thing that we're working on
right now, actually, we've got a release
coming out where we're going to change
how how no No downtime upgrades work,
so we're releasing raft by the time this
comes out, it should already be live.
If there aren't any big, big kind of
surprises, but that's a big thing.
I think the multi vector support was
rolled out in the, in the last update.
People are now starting
to, starting to use it.
And I think the
quantization support, right?
At the beginning of this year,
we had one way to quantize and.
Within like four or five months, we've
already got we're supporting like four
different quantization techniques.
So we're going really heavy into
making vector search affordable so that
it's not even a thought rate going.
If you want to do search, you're
going to be using vector search.
So we're investing a
lot into that as well.
I think those are the main frontiers
on the vector database technology
that people should look out for.
The other thing that we're doing
because we're Bringing everything
that's AI related closer to the
database is making it really easy for
people to go from zero to one, right?
If you're thinking about using a
vector database or you don't need to
learn how embedding models work, you
don't need to learn about all of that.
You should just be able to like write a
few lines of code to try this thing out.
Build a POC and then
help, help them scale it.
So I think that those are
the main things that we're
Nicolay Gerold: What technology
are you most excited about outside
of vector databases and LLMs?
Zain Hasan: Yeah.
This is a good question because the
only, literally the only things that
I think about are search and and
generation like LLMs and what technology.
Oh, yeah, this is a yeah.
So my my background originally
is in biomedical engineering.
Brain computer interfaces are a technology
that I am super, super excited about.
I don't know a lot about
kind of gene editing CRISPR.
If I knew more, I'd probably be
more excited about that field.
But I think brain computer
interfaces are very, very exciting.
There was a paper that came out where
we have language foundation models.
They built a EEG foundation model.
They took the signals from your brain,
the electrical signals from your
brain, and they trained a foundation
model to understand those signals.
And I think if you take software
like that, if you take foundation
models that can understand the the
electrical activations in your brain,
then you combine that with hardware.
Whether it's non invasive, so there's
a lot of non invasive kind of headbands
that are trying to measure EEG
signals and then, of course, you have
Elon's really invasive device that
goes into your brain but if you take
those technologies, and if you take
representation learning techniques
that allow you to decode brain signals.
That I think is is super exciting.
That is kind of related to vector
databases, but I think it's sufficiently
different because representation
learning is in everything.
I don't think you can get away
from that, but BCIs are super
Nicolay Gerold: And there's like also like
interesting development that there was
a deep mind paper on predicting weather
with a graph structure which was overlaid
over the longitude and latitude lines.
And I think this was, and they also
use basically an autoregressive model
to predict the changes in state.
And I think this is a principle
you could also really interesting
apply to the brain if you have like
data to actually understand it.
And this is also like a form of embedding
because you're mapping onto a smaller
structure which also gives you a better
feature, which you can better with.
If people want to start building
the stuff we just talked about,
where would you point them to?
Zain Hasan: Yeah.
So
first you can get started.
You can go to weviate.
io and you can have a look
at what we're building.
There's a lot of kind
of education material.
There's blogs.
There's the Weviate podcast.
Everything that we do at
Weviate is open source.
We're even building.
So if you want to get
started with RAG, weviate.
com and you I would
suggest people go to Vrba.
It's an open source GitHub project
that we're building and maintaining
and upgrading at Weaviate.
It's probably the easiest way to go from
Having no RAG application to having a
RAG application POC within a day, you can
fork Verba and you can change a lot of
the things that are that are modifiable.
You can change the embedding model, you
can point it to a GitHub repository,
you can embed all of your data.
It has, it takes care of a lot of
the things for you, it automatically
chunks and it's, it's almost like
a production ready app that is
customizable to your, to your data.
If you're interested in that,
I would get started with Verba.
If you're interested in learning about
multi modality, if you're interested in
working with vector databases, we also
put out a lot of educational content.
So we've got a couple of courses on
LinkedIn that you can look into that
that teach you about vector databases.
We also partnered with deep
learning AI to put out courses.
So if you search for Weaviate
deep learning AI courses, that
will also get you started.
And also Weaviate Academy, I
think, is a great resource.
So if you're just getting a one on
one on vector databases, I would go,
I would go there and then we're always
adding new courses and new material.
We just added a new module on multi
vector and how to use that for recommender
systems for movie recommendations.
So I think those are some of the resources
and then for more interact interactions,
if you want to reach out to me, or if
you want to reach out to anybody at
Weaviate you can join the Slack community.
You can join the forum
and you can ask questions.
We're, we're very active there as well.
Nicolay Gerold: So what can we take away?
First of all, like to maybe double
down on floating point precision.
So floating point precision is
the most common way to actually
handle numbers within AI.
So most of the model weights are actually
stored in floating point precision.
Ergo, the vectors, the embeddings we
produce with AI models, through embedding
models or through different types of
models like autoencoders, usually are
in floating point precision as well.
And within floating point precision,
you have a bunch of different options.
The most common one is FP32.
Which is also called like signal precision
and it uses 32 bits of memory per number.
And this has a very good accuracy,
decent speed, high memory consumption.
The half of that FP16 uses only
half of that, so 16 bits per number.
And it's much faster for calculations and
the loss of accuracy, especially in AI.
is really miniscule, so it's really small.
So FP16 nowadays in AI is really common
to use even when you're in the cloud
with bigger GPUs, just because it's,
the sacrifice in accuracy is so small.
And when it comes to vectors,
You can do even more stuff.
Like what we talked about
with binary quantization.
You basically turn each
number into zero once.
And this works for vectors because
usually they're very high dimensional.
So it still splits the space in
a lot of different directions.
So you still have a
decent amount of accuracy.
And this cuts the memory
consumption basically by 97 98%.
And that's it.
The other two parts of quantization we
talked about, so basically the scalar
one and the product quantization,
they're a little bit more complex.
The scalar one is a middle
ground, I would say, between
binary and product quantization.
In its complexity, it usually
uses 8 bits so it keeps more
precision, but it also saves less.
So it's basically a good trade
off between Memory consumption is
going down, speed is going up, but
the accuracy is also going down.
Whereas binary conversation really
optimizes for speed and memory.
With also a decent
amount of accuracy lost.
So you really want to pick,
okay, what matters most to you.
Product quantization, on the
other hand, does way more.
It splits up vectors into chunks,
groups similar chunks together, and
then stores IDs instead of full numbers.
And This is especially good, like
when binary quantization force
fails, you want to grab probably for
something like product quantization
and it's more complex, but it's
really flexible and you can cut your
cost by a lot by just quantizing and
then basically you try to hit it.
The amount of memory or the
amount of speed you actually need
and then try to squeeze as much
accuracy within it as possible.
And this really guides you into the main
trade off, like cost versus quality.
Your quality is basically the accuracy
you get and the costs are mostly based
on the memory or based on the speed.
When you're quantizing memory consumption,
of course, the storage is less.
In terms of speed, you could use
FP32 and reach the same amount of
speed as with binary quantization
by just throwing more compute at it.
So the cost quality trade off is
the main one you have to consider.
And in that, you basically want to
optimize what's your use case, and
you want to pick small trade offs.
If you have really small data, you
probably can just use regular vectors.
If you have a large amount of
vectors, you actually have to
start thinking about quantization.
And then you basically have to think
through, okay Depending on the use case,
depending on what I actually want to
optimize for, which type of quantization
should I opt for, or if really just the
accuracy is the most important driver
and you don't care about the cost you
can, you don't have to quantize at all.
Always when you make your choice, really
first determine what you're optimizing
for, and then basically test your needs.
First.
Try a few options.
Try, for example, binary quantization
and see whether you can reach
the desired level of accuracy.
And if it works, great, you
got binary quantization.
It's the cheapest method.
Your accuracy is good enough.
You can go with that.
If not, move to the other
methods and try them out.
And An important part within that
is always watch your data closely.
Check the spreads, check the
distributions, also check the
distance results you're actually
getting in your vector search.
Because this will tell you a lot.
Binary works best if you
have a normal distribution.
And in the real world we
often don't really have that.
So you might have weird spreads,
which you actually have to monitor.
And Also, watch your budget.
Start cheap, and then you can
still, over time or later,
add quality where it matters.
But when you want to add
quality, you don't need to use
FP32 or really high precision.
You could also think about
adding a re ranker at this step.
After the retrieval and really
overfetching with binary quantization.
It really depends in the end how
you want to architect your system.
I think what's a really interesting
aspect in that is actually quantization
enables you to do a lot more by actually
making the storage so cheap that
you can actually store more vectors.
I think that's a pattern that we
see a lot in technology when you
make something cheaper or easier.
We actually see a lot more of it.
And when the vectors get really
cheap, we can store multiple
representations of the same data.
For example, if I have a text, I might
embed the full text, I might embed a
summary, I might embed multiple vectors
for each document and store it as a multi
vector associated with one document.
I might Embed like a question the document
is answering, and then I can run retrieval
over the different vectors and actually
search through that instead of just one.
And this will actually allow me to
do a lot of cool stuff, which is just
enabled through the really cheap storage.
And yeah that's it for this week.
We will continue with Search next week.
So stay tuned for that.
We will probably move into a different
direction, move into Knowledge
Graphs to close out the season.
And stay tuned for that.
If you like the episodes, let me know.
Also, leave a like on YouTube,
leave a review on Spotify or Apple.
And otherwise, I will catch
you next week, so see ya.
Listen to How AI Is Built using one of many popular podcasting apps or directories.