S2E23

#040 Vector Database Quantization, Product, Binary, and Scalar

January 31, 2025 · 52:12

Nicolay Gerold: When you store
vectors, each number usually takes

up 32 bits within the vector.

So with thousand numbers per
vector and millions of vectors,

the cost really explode.

A simple chat bot can cost thousands
per month just to store and search

through the vectors, through
your embeddings in the database.

The fix.

It's quantization, and you can think
of quantization like image compression.

JPEGs look almost as good as raw
photos, but take up far less space.

And quantization does
the same for vectors.

The trade off is, you lose some accuracy.

But often, that loss doesn't
matter or is very negligent.

Just like the Slight loss
in quality in a JPEG.

And today we're back continuing our
series on search with Sahin Hassan.

Sahin is a former ML engineer at the
Vector Database Weaviate and now a

senior AI and ML engineer at Together.

And we talk about the different types
of quantization, when to use them,

how to use them, and their trade offs.

Let's do it.

Zain Hasan: When people think
about vector databases, they think

about them with regards to ChatGPT.

So they think, oh, vector databases
are a way to store my text data

and then chunk it and then retrieve
relevant and then give it to relevant

pieces and give it to ChatGPT.

But vector databases can handle text
image, audio, video data just as well.

Right?

If you have a machine learning model
that can encode that data, then a

vector database doesn't care because
the vector of an image is pretty much

the same as the vector of the text.

All of the mathematical computations
for that vector will be the same.

So going from text to multimedia in
vector databases is really simple.

If you have a machine learning model
which there are a lot of open source ones.

And then the second thing
I think is quantization.

So a lot of people think that vector
databases are this really expensive

thing that you know, it's like
this fine, fine dining cuisine.

You go to a restaurant, you have
to pay like a thousand dollars

a pop or something like that.

A lot of people have this view in
the field that, oh, you're doing

vector search, it must cost a lot.

I don't think a lot of people
are using vector quantization.

Right.

If you're using the entire vector that
comes out of the the the model that can

end up costing a lot, but oftentimes
you don't need the entire thing.

You can quantize it.

And we can talk about this and you
can cut down your cost by by 90%.

80 percent just by quantizing, depending
on how much accuracy you want, right?

So there's a lot of kind of tuning.

Of costs, speed memory footprint
that I don't think a lot of

people are utilizing right now.

It's still a new new field, but I think
those are the couple of things that a

lot of people are surprised to hear.

Nicolay Gerold: What do you think
quantization wise, what are the

trade offs you're actually doing?

Do you have a matrix you're looking
at, or a different set of metrics?

Zain Hasan: Yeah.

So basically when, when you're
evaluating or looking at how good the

vector search component of your RAG
stack is, there's like two or three

things that you're really looking at.

You're looking at how fast this thing is.

So how many queries can
you can you can you serve?

So you could think of that as throughput.

Another thing that you're thinking
about is what is the the total,

what do you call it, the memory
footprint of your index, right?

So if I've got a million objects
that are in my vector database

how much in memory cost does that,
does that amount to and then you're

also looking at the the recall.

If you're passing in a query, are you
even getting back relevant things?

And so usually those three
things you're balancing.

If you have a, if you have a application
where recall is the most important

thing, then you might want to sacrifice
latency and memory memory consumption.

to boost up recall to
the highest possible.

Okay, so there's a lot of things that
allow you to kind of fine tune and

balance between those three things.

Vector quantization is one thing, right?

So vector quantization basically says
instead of taking So let's say you

have a thousand dimensional vector.

Every single dimension of that vector is
a 32 bit floating point number, right?

So every dimension is four bytes.

And so if you have a thousand dimensions,
that's 4, 000 bytes per vector that's

stored, and then that's scales up.

If you have a million, then
you've got you've got a lot of

in memory requirements, right?

Vector quantization basically says.

Are you willing to let go of
some of those bits in order to

reduce the memory footprint?

And so, now if you can Turn you if you
can go from four bytes to four bits or

four bytes to one bit Then you can you've
basically Gotten rid of there's a whole

spectrum of information that you can
get rid of And you're reducing memory

footprint, but at the cost of recall right
because now your vectors are becoming

fuzzy You don't know where they exactly
are you're using less less memory to

capture the location of every vector.

And so there's people have found different
techniques where what is the right

amount of information to get rid of?

How to get rid of it?

What does your distribution of
the data need to look like before

you apply vector quantization?

So that it still works afterwards.

And so there's a whole kind of school
of methods, product quantization, binary

quantization all of these different
methods that people are now working with.

So you mentioned mixed bread.

Cohere was also working on this.

I think binary quantization is something
that people are really excited about

now because it's kind of amazing to
think that you could have a thousand

dimensional vector and then you could
only keep one bit per dimension.

So it's almost like saying, You know,
let's say there was a party and you

describe to me the location of that party.

Give me the total address and
now you tell me and I'm not going

to give you the total address.

I'm just going to tell you every
time you come, come to a street,

whether you go left or right, and
that's all I'm going to tell you.

I'm just going to tell you consecutively
one after the other left, left, right,

right, left, right, left, right.

Like you could do this and you could
still capture vector distances pretty

accurately which is pretty amazing, right?

So I think people are correctly
excited about binary quantization.

Nicolay Gerold: What are the other levels
you could pull to basically move along

the different metrics you just mentioned?

Zain Hasan: Yeah.

So think about a full vector is
32 bits per dimension, right?

So you've got all of that accuracy to
capture exactly how far along to move

of the axis of a dimension, right?

So if you have a 2D vector, you've got
32 bits to represent how far along the X

axis and how far along the Y axis to move.

So you can exactly locate
yourself in a, in a 2D 3D.

Space, like a, like a Cartesian
coordinate system, you can

exactly locate the vector there.

If you binarize that vector,
now you've only got a zero

or a one for both dimensions.

So now you're either in the
first quadrant, the, the you're,

you're basically in one of four
quadrants, and every vector has

to be in one of four quadrants.

So, This is the problem
with binary quantization.

If you reduce the number of
dimensions, it doesn't work.

But for 1000 dimensions, it works
really well because you can have

you can have 1000 squared possible
number of different unique vectors.

And that exponential blows up much
faster than the total number of

vectors that a lot of people have.

So this is why binary quantization
is working quite well.

If you find that.

Binary quantization doesn't
work, or maybe your distribution

of the data is not good, right?

Because what you do in binary
quantization is you say, is the

dimension positive or negative?

If it's positive, you
turn it into a one bit.

If it's negative, you
turn it into a zero bit.

If the, if the data is.

Not kind of normally distributed.

Let's say you don't have an
equal distribution between

negatives and positives.

Everything is a positive, then all
of the vectors will just be ones.

So then binary quantization doesn't work.

So you have to be careful about the
distribution of your data before

you apply binary quantization.

If you're finding binary quantization
is not working, then in Weaviate,

we're actually we've implemented a lot
of different quantization techniques.

So you could try something
like product quantization.

Product quantization is slightly more
complicated than binary quantization,

because it takes the vector.

It cuts it up, and then it says, okay,
I'm going to cluster this segment of all

of my vectors together, and I'm going to
find some centroid that represents this.

And then it does this
for every single segment.

And now, anytime a new vector comes in,
Take the first segment and you say, okay,

instead of representing it with these
numbers, I'm just going to assign it to

the ID of the cluster that I use for this
segment and I've got so many segments.

So now I only need to remember
what those cluster IDs are.

And and that's all right.

So we started with product quantization.

Then we went to binary quantization.

Now we're also looking at a couple
of other quantization techniques.

Scalar quantization is another one
where you simply take the 32 bit vector

and you reduce it down to eight bits.

So there's a lot of different things
that people are trying out, but, one of

the the reason why this reduces cost.

The reason why I said people are not
using it as much as they should is that

it's almost like a superpower, right?

You start to think that vector
using vector databases will

cost you thousands of dollars.

And then you apply quantization.

And not only does it cut
down the memory cost.

But you can also run like, depending on
how many objects you have, you can run it

in you can run brute force search, right?

If you've got 10, 000 objects.

Or 50, 000 objects.

You use binary quantization.

You can run in memory search.

You don't need to build
approximate nearest neighbors

indexes or anything like that.

So this is another feature that we
implemented where given a certain number

of objects, you can just Vectorize them,
then we binary quantize them, and then

we do on what do you call it we call
it flat indexing, where we don't even

build up the HNSW graph, we just do
brute force comparisons, and because the

binary binary vectors you can optimize
the distance comparison between them.

It's quite efficient and you can
get away with with just brute

force searching through them up
to a certain number of objects.

Nicolay Gerold: I think quantization
It builds on the assumption that there

is still noise in the embeddings.

And if I'm looking, it's pretty
similar as well to the thought

of Matryoshka embeddings that I
can reduce the dimensionality.

What do you think what technique
has more potential quantization

or Matryoshka embeddings?

Zain Hasan: Yeah, I, I would say that
Matryoshka embeddings are kind of like

a type of quantization as well because
In this type, so this is an interesting

question because in the case of
Matryoshka embeddings, what people are

doing are basically changing the loss
function so that the learned vectors

the distribution of information in the
learned vectors is not all the same.

So when you, when you use a off the shelf
embedding model, you can't really say,

okay, dimension 57 is really important, so
make sure you never quantize that, right?

Pretty much every dimension
is equally important.

And when you look at the variance of
of numbers in the dimensions, they're

all pretty much the same, which is
evidence towards the fact that, oh,

all the dimensions are capturing
an equal amount of information.

And when you apply binary
quantization or PQ on top of those.

You, you take that into account, right?

You don't you don't squish one
dimension more than the other

dimension because they're all equal.

MRL, matryoshka representation
learning changes this, right?

It says the first X dimensions
are much more important.

The next.

The, the two end dimensions are less
important and then you keep on going.

So by the time you get to the end the
dimensions are not that important.

So it's almost like saying the first a
hundred, 500 dimensions will give you all

the information you need, and then the
next 500 will give you a little bit of

details around those main main dimensions.

I think this is quite a, kind of a unique.

Way I spoke to the, the first author
of that paper, and he said that the

reason the, the motivation behind that
paper was that he didn't have enough

compute when he was running experiments.

So he wanted to train four models.

So he nested the loss function for
those four models together, and

out came these MRL vectors, right,
where you've got the first 500, the

next 1000, the 1,500, and they're.

Because you've nested the loss, you've
got like different amounts of information

being captured by every dimension.

And it's interesting now that this
technique is being used so widely, but in

going back to your original question of.

Which one I think is more promising
I think probably normal, like binary

quantization and things like this,
because it doesn't require the

initial embedding to be trained
in any particular way, right?

I can just go in and apply PQ,
or I can go in and apply BQ and

it will be, it will be okay.

Now, having said that, Cohere did
some work and they, they, their

new embedding models are Cohere.

quantization aware, right?

Their loss function has binary
quantization, and and they also have

scalar quantization into account.

And They prove that if you train
the embedding models to know about

quantization, you can actually
keep a lot more of the accuracy.

So this is going more towards
what MRL was doing, right?

They're letting the embedding
models know to optimize through

changing the loss function.

Yeah, a lot of people right
now are just using these PQ BQ

techniques without the loss function
even knowing about it, right?

And this is why I think it's a little
bit more doable where in Weaviate you

can put your data in and you can say,
okay, just enable BQ or enable PQ.

If the embedding model was trained
with the quantization technique

in mind, you'll lose less recall.

But for MRL there's no way you can,
you can just enable MRL, right?

If you, if I have a vector here
that has no idea of MRL, I can't

just say, you know, now it's MRL.

You kind of have to fine tune the
model with an MRL loss function to turn

those vectors into matryoshka vectors,
which I think is a, is a bigger barrier

for people to, to use MRL as opposed
to just saying, you know, configure

a vector database to use BQ or PQ.

That I think is a lot more easier.

If OpenAI is now providing MRL vectors
that, of course, you don't have to

worry about it then, and you can just
start chunking it that I think a lot

of people are using those, and they're
discovering the power of that, right?

The other cool thing that you mentioned
before this was the multi vector

support I really, I'm really interested
by this idea of how MRL, Can take

advantage of multi vector support
to do adaptive retrieval, right?

That's another really fascinating
thing that's happening in the

field where people are doing.

Search on on like smaller vectors
that capture more information.

And then if you want to get more
accuracy, you take a bigger piece of

the vector and you redo the search,
take an even bigger piece and you redo

the search, depending on how, how much
time you have, depending on how much

time you're willing to invest for that
query versus how much accuracy you

actually want out of that query as well.

So I think.

That's another really interesting
topic that is unexplored right now,

but a lot of people are moving towards.

Nicolay Gerold: Yes, and
especially looking at embeddings.

I think I still sometimes
it's like the vector store.

I'm still sometimes torn between a
feature store and the document store.

Like, how do I transfer each of
those models onto a vector database?

How would you place like a
vector database in the general?

Database sphere and how does it
delineate from feature stores and how

does it delineate from document stores?

Zain Hasan: Yeah, so I would say a
vector database This is something

that we were initially in the company
talking about, like a vector search

engine versus a vector database, and
then an AI native vector database.

If you think about a, an AI native
vector database, essentially what we mean

is all of the, you've, you've got the
database kind of core component, but.

All of the AI related stuff, like the,
the inference, the embedding model that's

required to generate the vectors also is
connected to the, to the database, right?

The quantization models are
also connected to the database.

The language model that you might
eventually be using to generate

off of what you retrieve can
also be easily hooked in and

integrated into the database, right?

So if you think about this idea
of an AI native, Vector database.

This is what we mean by it.

Everything that you need to build the
AI app should have a easy integration

or either should should have an easy
integration into the database or should

live in the database itself, right?

Or be connected to the database.

If you think about if you think about
it from that perspective, I think

this is kind of a core infrastructure
that is touching all of the other

parts of your AI native tech stack.

Whereas a feature store, I think
is probably a long term store

for your vector embeddings.

If you have a data engineering pipeline
and you're taking the those features

and you don't really want to perform
vector search over them, right?

You don't want if you store it in the
vector database, then you're going to

be paying for that there are ways to get
around that, but like, you don't just

want to use a vector database to store
your data and never search over it, right?

That's what a feature store is for.

If you want to take your data
and then you index that you store

in a vector database and you're.

Performing a large number of queries on
your actively using that data, actively

updating that data, modifying it as
users are interacting with your platform

and changing the representations.

That's what I think of vector database is

Nicolay Gerold: Yep.

Are there any plans to actually
support like pruning of the database,

like removing unused vectors or
even integrating something like

foreign data wrappers that I can
hook into a cloud storage where like

older embedding vectors are stored?

Zain Hasan: Yeah.

So currently.

We do the way we get rid of this is
something that we're actually changing.

Now, but we, when you delete a vector,
because a vector database is like it has

all the capabilities of a normal database.

You can do all of your crud
operations when you delete a vector.

If you have an HNSW index, it's not as
easy as just kind of disappearing that

vector because that vector is connected
to a bunch of the it's a proximity graph.

So it's connected to a lot of
the data points in the proximity

of that of that vector, right?

So getting rid of this vector means that
you have to update the dependencies of

this node on all of these other nodes.

So you have to either connect them to
each other because now the sky is gone.

So that algorithm is
already in place, right?

So when you delete a vector
that removing that from the

from the graph is already there.

I'm not sure what you mean by pruning.

Is that, is that what you
meant or did you mean?

Exactly.

So that's always been there.

That's like a core functionality
of vector databases.

We are updating it and
making it more efficient now.

Before we used to just reconnect.

So let's say you had three nodes,
one, two, and three, if you wanted

to delete this guy, we would just
take the connection from this guy and

this guy and we would connect them.

And now because there's no
connections here, there's no way

for you to, as you're kind of spider
webbing, jumping from node to node.

There's no way for you to go
from this node to this node.

There's only a connection
between this node and this node.

So now this guy might as well not exist.

It's still there physically, but
it's not, it's not accessible.

And then when we have some
downtime, we just remove all

those kind of unreferenced nodes.

But now we're actually making that
algorithm a lot more efficient.

Nicolay Gerold: nice.

And what is really interesting to me
the focus on recall, especially versus

precision, what Are actually the levers
I could pull in a vector database to

also bring precision into the picture
beyond like the similarity threshold.

Zain Hasan: Yeah, so I think it depends
on, it depends on your application, right?

There's a lot of different levers,
so We've talked about the HNSW index.

You can modify the parameters
of the HNSW index and that will

give you some some control over
over precision outside of that.

There's a lot of other things that a
lot that people are doing now, and this

is because of the popularity of rag.

People are realizing that the R part
is Just as important, if not more

important than the generation part
of their ragstack, because if you're

our part sucks, then the generation
will be full of hallucinations.

So people are coming up with all
these techniques to do retrieval.

Initially, it was the conversation was
mainly around chunking and then people.

That realized, like, what if we have huge
documents and we, there really isn't a

good way to do semantic level chunking
over those, how do we capture those?

And then people realize, okay,
you have to use classical

keyword search in those cases.

If you have gigantic kind
of textbooks there really is

no way to do single vector.

You have to either do multi vector
support or you have to fall back onto

things like BM25, where you're doing
keyword searches through those and then

people said, okay, let's say we combine
vector search with with keyword search.

We can use hybrid search and
hybrid search gives you really

good improvements in recall, right?

So pretty much all of the use
cases that we're seeing take

advantage of hybrid search.

Because there you, you have this kind
of fine tuning parameter to say how

much vector search do you want and how
much keyword search do you want, right?

I think that's probably the biggest boost.

Another boost is using re ranking, right?

A lot of these models in vector search
they're relative in the sense that

the distances you measure, like in
one model, a cosine similarity of 0.

9 might be considered.

you know, very high.

But in another model, if you use a
different embedding model that was trained

differently, a cosine similarity of 0.

6 might be considered super high, right?

So these distances and then if
you're looking at dot products, in

that case there is no bound, right?

So then the threshold of similarity,
it depends entirely on your, on

the distribution of your data.

So you have to kind of see.

The the average distance between vectors
and you say, okay, anything above this

or below this will be considered similar.

So I think they're using these more
powerful models, like re ranking

models you can really use that to
increase increase position as well.

So this is another thing that we've
recently been adding to in Weaviate.

Because we want all of these different
tools to be accessible from the user

of the database, we've been adding
modules that allow you to say, do

hybrid search, but not only do hybrid
search, but now that you've gotten

like 25 things back that are relevant,
do a re ranking on these 25 things.

So you can call a re ranking call on top
of the hybrid search call, and that's

going to go through your 25 objects.

And then it's going to compute
individual scores of the query and

all of these one by one and then
it's going to re rank those for you.

So I think that's another cheat code
that a lot of people are not using.

And it's a very simple kind of
add on to your current workflow.

Instead of just doing vector search
you can just do vector search

with re ranking and you see an
immediate boost in, in position.

Nicolay Gerold: Yes.

Do you already support that I can
search with a different multivector

system, that I can search with a
different vector in another vector?

If I, for example, have two vectors
in a multivector system, like one

for the entire document and one for a
summary, can I use the summary vector

to search in the entire document
vector store or in the vectors?

Zain Hasan: That's a good question,
but so before I answer the question,

the assumption is that the machine
learning model that's used to generate

the entire summary vector and a
chunk of that are exactly the same.

So currently we just added
a multi vector support.

I'm not too sure whether it's possible
to search, oh, actually, you can.

So now that I'm thinking about it.

So let's say you configure
the database, right?

You start up with a you configure it.

You want to two named vectors.

You've got summary vectors, and you've got
chunks from the original document, right?

You've got those two indices.

Now it's going to go in and build up an
HNSW graph of summary and one of Chunks

that you've extracted from the document.

And now let's say you come in
with a with a summary, right?

If you go in with the summary and you go
to the first named vector, the summary

index, you'll get back other summaries.

But if you take that summary, you
vectorize it with the same module because

these two two indices are vectorized
using the same embedding model.

You can just take the summary and
search it Search the chunk index

over this, and this allows you to
do the summary to chunk search.

And if you do the other way, if you have
chunks and you want to say, okay, what

summary might this chunk have come from?

Do you just take the chunk, pass
it in as a query to this first name

vector, and you'll get back summaries.

Yeah, so.

Initially, I was confused because
I wasn't sure whether whether

that was a database level thing.

I think on the database, the
only thing that you need to

have is multi vector support.

The query itself can be rerouted
to one name vector or the other,

Nicolay Gerold: Yeah.

And this opens like way more
recommendation system use cases,

but also really fun rack use cases
like the adaptive one you mentioned

before that you actually can escalate
to like different presentation.

Nice.

How do you actually see vector
databases changing the way we actually

store and process and interact
with data over the next like three

to five years, like horizontal.

Zain Hasan: yeah, this vector
databases are pretty much the The

commercialization and the productization
of representation learning, right?

Initially, when I started at Weaviate, I
didn't know what vector databases were.

And then I started learning and I
realized that vector databases are

just representation learning scaled
up and and made available where

any company can use representation
learning to do search and retrieval.

And so this is super powerful because
machine learning models in PyTorch or like

from, from papers are very inaccessible
to a startup or to even companies that

don't have machine learning experience.

And where I feel vector databases
come in as kind of and save the

day is that they abstract away all
of these different things, right?

You can make smart default choices
about about the HNSW index parameters

what type of what type of index you
want to build what machine learning

model you want to use initially.

Like you can just go ahead with a
default one and see how well it works.

Of course, kind of Getting your feet
wet, fine tuning the embedding model

and doing all of this stuff will
improve the accuracy when you want to

go from zero to one, someone not using
vector search to using vector search.

I think the way vector databases will
change the landscape is that people

right now when they want to search.

They think very in a very
computer manner about how the

search needs to happen, right?

We look for words, or we
look for for matching kind of

patterns or something like this.

Vector databases allow you to take all
the advances that are happening in machine

learning and now just simply turn a
switch and use them for your application.

I think this is a really exciting thing.

Okay the, you mentioned
recommend recommendation systems.

I can go from searching for
recommendations over product

descriptions to now combining
that in a multi vector world with

searching for recommendations over the
images and how those products look.

I don't need to be a machine
learning expert to do that now.

In fact, we have an entire kind of
section of our team that are working on

educating JavaScript developers and web
developers on how to use vector databases.

Can you imagine, like, you don't, you
have, you don't need to have any machine

learning expertise to use it, right?

You don't even need to learn
Python now to, to access all

this machine learning knowledge.

And I think this is really the
power of vector databases, right?

I think it lives outside the
machine learning world where.

Everybody can now access really
smart human level search for

their, for their applications.

And so in five years, I have no idea
where the field is going, but in the

next year, two years, seeing people
take in vector search and use it

for their web apps, I think that's
going to, that's going to be very

Nicolay Gerold: What are maybe some
of the most interesting, but also

like weirdest use cases you have
seen implemented on top of VV eight?

Zain Hasan: Yeah, that's a good question.

So there is there is one particular
application where this person I

spoke with was looking at different
kind of palm, palm images.

And they were trying to see if they
could, because palmistry is a, is an art

and people are really interested in it.

They were trying to embed
images of people's palms.

And they were trying to see if they could.

Given your palm, situate your palm and
find out other people's palms and see

if they could find some sort of kind
of if they zoom out, see some mosaic

of, oh, these palms end up rich, or
these palms end up living a long life.

This was like, if you're asking about
the weirdest thing, this is, I think,

probably the, the most interesting and
weirdest application and it goes back

to kind of, Discovering high level
distributions, like semantics and data

where we, you know, people think that
there's something there, but let's embed

it and let's see if there is actually some
structure in how the way our palms look.

So I think that

Nicolay Gerold: What would be a
use case you would love to see

implemented on top of VVI eight?

Zain Hasan: Yeah.

I think there's a lot of potential
with recommender systems.

Like a lot of people are going in really
heavily into rag and it's understandable

because Like, I think it's a very exciting
space, but I think recommender systems

can definitely be revolutionized by
this as well, because If you combine

recommender systems with multimodality,
I think that completely changes how

recommendations can be made, right?

Like if you think about how
recommendations are made right

now, typically, you're looking
at product descriptions and how

they interact with products that
you've bought or interacted with.

So if the description is similar, then
you'll be recommended those things, right?

Tomorrow, what might happen,
and some companies are already

doing this, that they've.

Published meta with the with with their
Facebook Facebook marketplace, and Amazon

has already published on this where
they're doing multimodal recommender

systems where they take, they generate
vectors, which are combinations of images

and descriptions, right, text and images,
and then they tune these models, and

then they make recommendations based
on not just how a product is described,

but also the way the product looks.

And this allows you to recommend.

within a category much more accurately.

So right now if you if you search on
Amazon or Facebook Marketplace for a

couch and a very specific description,
you might get other couches, but

that ones that don't look the same.

But if you now have image as a
modality to recommend off of,

now you can differentiate between
different types of couches, right?

Because the description might be the same,
but the way they look might be different.

And so now you can More accurately
identify what a user likes and they don't

like because now you have another sense.

You have the sense of
sight to do that with.

And now if you extrapolate that
into the future, there's a company

that spun out of Google called Osmo
that's working on digitizing smell.

As well, right?

So now let's say this is not publicly
available right now, but they've

published some papers on this.

If you can embed smell, you
can already embed video.

If you can take all of these modalities
and A lot of these companies already have

their products in these modalities, right?

Their marketing teams know what a Coke
sounds like when you pour it or what

a burger sounds like when somebody
eats it because those marketing

assets are already there that they're
using in their advertisements.

Now, maybe like two, three years
into the future, you could search

and say, This is the perfect burger.

This is the crunchiness of the burger.

This is the perfect perfume because
I know it smells exactly like

this old kind of memory from my
childhood or something like that.

Right?

So now you've got recommendations on
those other sensory inputs, which is

kind of impossible to imagine right now.

I think that's what Vector
search can enable in a field.

There's some people, there's some
companies that we're talking to

and working with right now that
are taking advantage of this

multimodal recommender system stuff.

I think that will only improve.

And I think three, four years from now,
we'll be amazed by the recommendations

and what senses they're coming from.

I think that

Nicolay Gerold: Yeah, and especially
in recommendation system, you can even

extend it to the user side as well.

So one use case we are exploring at the
moment is for basically skincare, but

you have images of the user conditions.

You have the images of the creams of the
products, you have the description of

the products, you have the ingredients,
you have the description of the condition

of the user, and also the description of
the user, and through that you can do so

many different interesting things what
are you now basically matching on screen.

Bye.

And, but also the reverse if I can't find
anything that matches, this already points

me there might be an interesting anomaly,
and I might even hint it you should go

to a doctor and actually talk to them.

Zain Hasan: for sure.

One, one other interesting
thing that I That I was thinking

about for this is a lot of the.

A lot of these big corporations,
they, they want to know

consumer buying patterns, right?

If you're buying product X
versus buying product Y, why

did you buy X and not Y, right?

If you think about this multimodal
search a little bit more and if you

if we're thinking about which senses
a person is using to buy shoes versus

what senses they're using to buy a
soft drink, you can even, like, this is

completely academic right now, but maybe
in the future, if you have data around

which products were purchased and, and
which modality was more important for

that purchase you can do some testing
to see, okay, when we do soft drink

recommendations based on how they look
or how they sound, more people use them.

Based on those recommendations tend
to convert and buy versus if we

just do it based on descriptions or
nutritional facts, people don't buy it.

So now you can even understand which
products are bought using which senses.

Like I buy burgers because of the way
they taste, not because of how they're

described or something like that, right?

But clothes I buy because
of the way they look, right?

I don't really taste the clothes.

So there that that sense is not useful.

So I think you can kind of break down.

User intent across these modalities
down to a science which is also,

it's kind of scary, but it's also
interesting because now people are

used to looking at a YouTube ad 10 more
seconds than they do on average, and

all of their ads changing into that
type of ad now you have to be wary of.

Kind of how you're using your other
senses as well, because now if you

can capture that sense and embed it
into a vector, now that's potentially

a way to be recommended an item,

Nicolay Gerold: Yeah, we finally made it.

The tech weeks can actually
quantify psychology.

Zain Hasan: exactly.

Nicolay Gerold: is next for VBA?

What you already can tease a little bit.

Zain Hasan: Yeah.

So one thing that we're working on
right now, actually, we've got a release

coming out where we're going to change
how how no No downtime upgrades work,

so we're releasing raft by the time this
comes out, it should already be live.

If there aren't any big, big kind of
surprises, but that's a big thing.

I think the multi vector support was
rolled out in the, in the last update.

People are now starting
to, starting to use it.

And I think the
quantization support, right?

At the beginning of this year,
we had one way to quantize and.

Within like four or five months, we've
already got we're supporting like four

different quantization techniques.

So we're going really heavy into
making vector search affordable so that

it's not even a thought rate going.

If you want to do search, you're
going to be using vector search.

So we're investing a
lot into that as well.

I think those are the main frontiers
on the vector database technology

that people should look out for.

The other thing that we're doing
because we're Bringing everything

that's AI related closer to the
database is making it really easy for

people to go from zero to one, right?

If you're thinking about using a
vector database or you don't need to

learn how embedding models work, you
don't need to learn about all of that.

You should just be able to like write a
few lines of code to try this thing out.

Build a POC and then
help, help them scale it.

So I think that those are
the main things that we're

Nicolay Gerold: What technology
are you most excited about outside

of vector databases and LLMs?

Zain Hasan: Yeah.

This is a good question because the
only, literally the only things that

I think about are search and and
generation like LLMs and what technology.

Oh, yeah, this is a yeah.

So my my background originally
is in biomedical engineering.

Brain computer interfaces are a technology
that I am super, super excited about.

I don't know a lot about
kind of gene editing CRISPR.

If I knew more, I'd probably be
more excited about that field.

But I think brain computer
interfaces are very, very exciting.

There was a paper that came out where
we have language foundation models.

They built a EEG foundation model.

They took the signals from your brain,
the electrical signals from your

brain, and they trained a foundation
model to understand those signals.

And I think if you take software
like that, if you take foundation

models that can understand the the
electrical activations in your brain,

then you combine that with hardware.

Whether it's non invasive, so there's
a lot of non invasive kind of headbands

that are trying to measure EEG
signals and then, of course, you have

Elon's really invasive device that
goes into your brain but if you take

those technologies, and if you take
representation learning techniques

that allow you to decode brain signals.

That I think is is super exciting.

That is kind of related to vector
databases, but I think it's sufficiently

different because representation
learning is in everything.

I don't think you can get away
from that, but BCIs are super

Nicolay Gerold: And there's like also like
interesting development that there was

a deep mind paper on predicting weather
with a graph structure which was overlaid

over the longitude and latitude lines.

And I think this was, and they also
use basically an autoregressive model

to predict the changes in state.

And I think this is a principle
you could also really interesting

apply to the brain if you have like
data to actually understand it.

And this is also like a form of embedding
because you're mapping onto a smaller

structure which also gives you a better
feature, which you can better with.

If people want to start building
the stuff we just talked about,

where would you point them to?

Zain Hasan: Yeah.

first you can get started.

You can go to weviate.

io and you can have a look
at what we're building.

There's a lot of kind
of education material.

There's blogs.

There's the Weviate podcast.

Everything that we do at
Weviate is open source.

We're even building.

So if you want to get
started with RAG, weviate.

com and you I would
suggest people go to Vrba.

It's an open source GitHub project
that we're building and maintaining

and upgrading at Weaviate.

It's probably the easiest way to go from
Having no RAG application to having a

RAG application POC within a day, you can
fork Verba and you can change a lot of

the things that are that are modifiable.

You can change the embedding model, you
can point it to a GitHub repository,

you can embed all of your data.

It has, it takes care of a lot of
the things for you, it automatically

chunks and it's, it's almost like
a production ready app that is

customizable to your, to your data.

If you're interested in that,
I would get started with Verba.

If you're interested in learning about
multi modality, if you're interested in

working with vector databases, we also
put out a lot of educational content.

So we've got a couple of courses on
LinkedIn that you can look into that

that teach you about vector databases.

We also partnered with deep
learning AI to put out courses.

So if you search for Weaviate
deep learning AI courses, that

will also get you started.

And also Weaviate Academy, I
think, is a great resource.

So if you're just getting a one on
one on vector databases, I would go,

I would go there and then we're always
adding new courses and new material.

We just added a new module on multi
vector and how to use that for recommender

systems for movie recommendations.

So I think those are some of the resources
and then for more interact interactions,

if you want to reach out to me, or if
you want to reach out to anybody at

Weaviate you can join the Slack community.

You can join the forum
and you can ask questions.

We're, we're very active there as well.

Nicolay Gerold: So what can we take away?

First of all, like to maybe double
down on floating point precision.

So floating point precision is
the most common way to actually

handle numbers within AI.

So most of the model weights are actually
stored in floating point precision.

Ergo, the vectors, the embeddings we
produce with AI models, through embedding

models or through different types of
models like autoencoders, usually are

in floating point precision as well.

And within floating point precision,
you have a bunch of different options.

The most common one is FP32.

Which is also called like signal precision
and it uses 32 bits of memory per number.

And this has a very good accuracy,
decent speed, high memory consumption.

The half of that FP16 uses only
half of that, so 16 bits per number.

And it's much faster for calculations and
the loss of accuracy, especially in AI.

is really miniscule, so it's really small.

So FP16 nowadays in AI is really common
to use even when you're in the cloud

with bigger GPUs, just because it's,
the sacrifice in accuracy is so small.

And when it comes to vectors,
You can do even more stuff.

Like what we talked about
with binary quantization.

You basically turn each
number into zero once.

And this works for vectors because
usually they're very high dimensional.

So it still splits the space in
a lot of different directions.

So you still have a
decent amount of accuracy.

And this cuts the memory
consumption basically by 97 98%.

And that's it.

The other two parts of quantization we
talked about, so basically the scalar

one and the product quantization,
they're a little bit more complex.

The scalar one is a middle
ground, I would say, between

binary and product quantization.

In its complexity, it usually
uses 8 bits so it keeps more

precision, but it also saves less.

So it's basically a good trade
off between Memory consumption is

going down, speed is going up, but
the accuracy is also going down.

Whereas binary conversation really
optimizes for speed and memory.

With also a decent
amount of accuracy lost.

So you really want to pick,
okay, what matters most to you.

Product quantization, on the
other hand, does way more.

It splits up vectors into chunks,
groups similar chunks together, and

then stores IDs instead of full numbers.

And This is especially good, like
when binary quantization force

fails, you want to grab probably for
something like product quantization

and it's more complex, but it's
really flexible and you can cut your

cost by a lot by just quantizing and
then basically you try to hit it.

The amount of memory or the
amount of speed you actually need

and then try to squeeze as much
accuracy within it as possible.

And this really guides you into the main
trade off, like cost versus quality.

Your quality is basically the accuracy
you get and the costs are mostly based

on the memory or based on the speed.

When you're quantizing memory consumption,
of course, the storage is less.

In terms of speed, you could use
FP32 and reach the same amount of

speed as with binary quantization
by just throwing more compute at it.

So the cost quality trade off is
the main one you have to consider.

And in that, you basically want to
optimize what's your use case, and

you want to pick small trade offs.

If you have really small data, you
probably can just use regular vectors.

If you have a large amount of
vectors, you actually have to

start thinking about quantization.

And then you basically have to think
through, okay Depending on the use case,

depending on what I actually want to
optimize for, which type of quantization

should I opt for, or if really just the
accuracy is the most important driver

and you don't care about the cost you
can, you don't have to quantize at all.

Always when you make your choice, really
first determine what you're optimizing

for, and then basically test your needs.

First.

Try a few options.

Try, for example, binary quantization
and see whether you can reach

the desired level of accuracy.

And if it works, great, you
got binary quantization.

It's the cheapest method.

Your accuracy is good enough.

You can go with that.

If not, move to the other
methods and try them out.

And An important part within that
is always watch your data closely.

Check the spreads, check the
distributions, also check the

distance results you're actually
getting in your vector search.

Because this will tell you a lot.

Binary works best if you
have a normal distribution.

And in the real world we
often don't really have that.

So you might have weird spreads,
which you actually have to monitor.

And Also, watch your budget.

Start cheap, and then you can
still, over time or later,

add quality where it matters.

But when you want to add
quality, you don't need to use

FP32 or really high precision.

You could also think about
adding a re ranker at this step.

After the retrieval and really
overfetching with binary quantization.

It really depends in the end how
you want to architect your system.

I think what's a really interesting
aspect in that is actually quantization

enables you to do a lot more by actually
making the storage so cheap that

you can actually store more vectors.

I think that's a pattern that we
see a lot in technology when you

make something cheaper or easier.

We actually see a lot more of it.

And when the vectors get really
cheap, we can store multiple

representations of the same data.

For example, if I have a text, I might
embed the full text, I might embed a

summary, I might embed multiple vectors
for each document and store it as a multi

vector associated with one document.

I might Embed like a question the document
is answering, and then I can run retrieval

over the different vectors and actually
search through that instead of just one.

And this will actually allow me to
do a lot of cool stuff, which is just

enabled through the really cheap storage.

And yeah that's it for this week.

We will continue with Search next week.

So stay tuned for that.

We will probably move into a different
direction, move into Knowledge

Graphs to close out the season.

And stay tuned for that.

If you like the episodes, let me know.

Also, leave a like on YouTube,
leave a review on Spotify or Apple.

And otherwise, I will catch
you next week, so see ya.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#040 Vector Database Quantization, Product, Binary, and Scalar

Subscribe