· 53:09
Nicolay Gerold: SQLite powers
nearly every app on your phone.
It's a database in your messaging
app, in your calendar, but also in the
browser and countless other applications
which are running on your devices.
It's bundled into Python and
it's also only one install
for example in Ruby or Node.
And its magic is in its simplicity.
It's one file to backup,
to move, or to version.
There is no complex deployment.
There are also no state
management headaches.
But up until recently, search in
SQLite was restricted to keywords.
But now we can build semantic search and
hybrid search right in the local database.
And today we are talking to Alex
Garcia, who is the creator of SQLiteVec.
And we talk about how SQLiteVec works,
the optimal use cases, and how you
can optimize it for your use case.
And And instead of basically having
a massive central search database,
each device can maintain its own
search index and the computation
happens on the user hardware.
So results are near instantaneous with no
network delay and also private data never
leaves the device, which is probably the
main motivator why you want to do this.
And this really shines for end
consumer or really privacy.
Aware or sensitive applications
like in medical or health care
Let's do it
, Alex Garcia: it's row oriented storage
it's, it doesn't have a good compression
or good Some of the other benefits
that you get from column oriented
stuff, you don't get with SQLite.
And also since SQLite is divvied up into
pages of data, and that's how things are,
like, written and stored most of the time
that doesn't matter for most applications,
but for SQLite, in fact, it does because
and I guess we've talked about this later,
but SQLite VEX stores all vectors in
like big chunks, like giant blobs that
are like, a couple of megabytes in size.
And the problem with that is since
page sizes in SQLite are like
four kilobytes each, that means on
disk, it's stored in like several
different pages all throughout.
It's not like in a
contiguous block of memory.
So there are small quirks like that,
that make some like analytical stuff
a little bit hard to do in SQLite.
But like the benefit is that if you're
doing like, transactional stuff, we have
an application that has many writes.
It's fairly, it distributes
it quite easily and it could
do writes a lot faster.
So that's usually the main benefit between
DuckDB and LanceDB for that kind of stuff.
That when you're writing an application,
it's usually faster, especially writes.
But it's, but it is a little bit weird.
The documentation for SQLite and
how they like the file format
storage is it's quite like verbose.
Like they, like every single byte in the
header page and all the other pages that
exist in it, you could, there's a lot of
documentation for it, and there's a few
uti There's a few tools in the command
line interface where you can like view
every page, see what data's in there.
B trees, leaf nodes, all that stuff.
So it's quite it's a little bit
weird, but I guess since it was
built for like transactional stuff,
it makes a little bit more sense.
Yeah.
Nicolay Gerold: Yeah, so why
did you decide to do SQLite Vec,
especially on SQLite and not
choose one of the other dataforms?
You could have done it probably in
Wasm, or could have done it in LanceDB
as well, on Lance or even Parquet.
Alex Garcia: Yeah.
I think I use SQLite Vec a lot for
other projects like data analysis
stuff and pipeline work and,
even small little applications.
So I was already using SQLite
a lot for that kind of stuff.
So having so building something
for that made a lot of sense.
And I think with SQLiteVec
compared to a lot of other database
technologies out there it's fairly
small, it's very lightweight.
The extensions it's very easy to build
extensions, whether that's, scalar
SQL functions or table functions or
virtual tables and all that stuff, the
documentation is quite nice and it's
been around for a while and it's quite
stable and I didn't go along thinking,
oh, When I started SQLiteVec, it wasn't
like, Oh, I want to build vector search.
Let me find a database for it.
It was much more like I'm a SQLite guy.
I use SQLite for a lot of projects.
I want an easier vector search
thing that I don't have to install,
10, 000 dependencies to use.
I just want something lightweight
and something that works in my
current workflow, so that's how I
stumbled into SQLiteVec specifically.
I use SQLite most of the time just
because of how easy it is to use,
it's already bundled into Python.
For most other programming
languages like Node or Ruby,
it's just like one install away.
So it's fairly easy to set up as opposed
to some other databases that are, a
little bit larger requires way more setup
commands or a separate server or whatever.
So that's most of the, that's most of
the reason why I had, why I was already
in like the SQLite world beforehand.
Nicolay Gerold: Yeah, and especially
SQLite, why I love it as well, in
the past few months, is just, you
can just have a local copy of the
database and you can experiment,
which is, just isn't there for.
Most of the other
Alex Garcia: Yeah.
And you can just, it's
a single file, right?
So you can like copy and paste
it if you want to make a backup.
That's not like transactionally
safe, but you can do it.
It's like completely fine.
So it's fairly easy to run a
quick experiment, destroy it if
you have to, back it up by just
like uploading the DB file to like
Google Drive or S3 or whatever.
So yeah, that's definitely another main
reason why like I usually reach for
SQLite just cause it's so easy to set up,
don't have to do much, just single file.
Very simple.
Nicolay Gerold: Yeah.
What else, like some of the use cases or
like some of the fun stuff you tried out
with SQLite vector, especially with like
local search, local AI on top of SQLite.
Alex Garcia: think most of my initial
projects with SQLiteVec have just been
like a better search engine, right?
So SQLite already has full text
search and I already built a lot of
applications in the past, whether it's
like personal stuff or for clients.
That is just like a straight
up search engine, like keyword
search, all that stuff.
But of course it's just
keyword search, right?
There's no like semantic search,
there's no fuzzing, or like deriving the
semantic meaning out of it or whatever.
So when I built SQLite Vect, that was
the main thing that I would do, was if
I had a pre existing search engine, add
vector search to it, and then you get
like all these cool new results from it.
That's been my main focus, just because
of how easy it is, and how it integrates
very well with the pre existing SQLite
full text search extension that exists.
So that's been like my main focus.
I think there's other fun little
projects that I've done where it's
building a small little classifier,
like a text classifier on top of
the vector search and embeddings.
Again, very easy to use in SQLite
just because if your data is
already in SQLite, it's very easy.
And if it's not, it's very easy to
get your data into a SQLite database.
And once it's there, embedding those and
building like a little classifier on it to
get like the K nearest neighbors, and then
find a label of, the most common label
of the closest 100 vectors or whatever.
Those have been, that's
been my focus so far.
I think also because I have a
background in Data analysis and
data engineering kind of tasks.
So that's usually what I go for when I
build whenever I'm using SQLite, it's
usually for some data analysis project.
So building a search engine
or building a small little
classifier makes a lot of sense.
I don't do a lot of, and I occasionally
do, but that's my main focus isn't in
like building AI applications, right?
I've seen other people and I've heard
other people talk about building,
a command line interface that uses
SQLitevec to store embeddings, right?
Because it's how easy it is, how it's
like on your computer right there.
Or maybe it's like a little web server
that like does some sort of embedding plus
storing and SQLitevec and stuff like that.
So on the transactional application side,
I've done, small little projects here and
there, but like my main focus has always
been like, build a search engine, build
a classifier, do some like recommendation
related stuff with SQLitevec.
That's been mainly my focus there.
Nicolay Gerold: Yeah.
And where are those actually embedded?
Are they running like in web
applications in the front end are they
running in the backend, embedded in
one of the CloudRuns, CloudFunctions
Alex Garcia: So I think initially people
would just run it, SQLiteVec when it first
came out, since it was, developer tool, it
would be running on your laptops, right?
So local development server on your
Mac or Windows machine or whatever,
running in a command line interface,
a little web app that you have.
I've heard people running it
on, cloud applications as well.
So like they deployed at
AWS or Google cloud or cloud
runner or any of those places.
And they're either storing or
doing like vector comparisons
with SQLiteVec on those platforms.
So it is totally possible to do, to use
SQLiteVec in like desktop applications
as well, mobile applications.
I recently added support and I will be
adding better support for Android and iOS.
So you could theoretically run
SQLiteVec and store vectors in a
SQLite database on your Android phone
or on an iOS application and such.
It runs on a Raspberry Pi.
Actually I have a little device here.
I have a little this
is a Raspberry Pi Zero.
It doesn't need Wi Fi, but it has a
little like keyboard and e ink display.
It's called a Beepy or a Beatberry
and this can run SQLiteVec.
It's just a Raspberry Pi.
So you could like SSH into
it, Compile or download a copy
of SQLiteVec, run it on here.
And you can have those embeddings run
and you could, generate embeddings
on here, generate embeddings on the
server, or wherever else you run.
So it's my, my main focus was like, get
SQLiteVec running everywhere, right?
And it does run everywhere.
Including, WebAssembly in the browser,
mobile devices, all that such.
And I've heard anecdotally, people
running it on all these platforms as
well, in either like test applications
or trying things out, mostly because
also SQLiteVec is fairly new, it
just came out like a few months ago,
Nicolay Gerold: yeah, I think it's
very interesting, especially with the
entire local first movement happening,
which got so many endorsements
from, like, all over the place,
which I would have never expected.
And I think the local AI part
is something I'm seeing more and
more, and we are implementing more
and more, and I think SQLite is a
very interesting choice for that.
I will be on the lookout, whether
Apple is integrating SQL live vector
for in their like messages or contacts
Alex Garcia: Yeah, I know that Apple
with their Apple intelligence stuff,
they do like on their slides, they
say semantic index, which is, an on
device vector index of some sorts.
I don't know if they probably
don't use sequel effect.
I know that with Microsoft, they have a
recall AI feature that was like, it would.
Take screenshots of your computer every
so often and you could talk to about
talk to your computer, be like, oh what
was I doing on this website before?
Which first off had a ton of privacy
and security applications that,
that was just absolutely terrible.
But one cool thing that came outta
that was that Microsoft was actually
storing their vectors inside of a
SQL wide database on your computer.
So it wasn't like syncing
to the cloud or anything.
They used their own cluster
customer implementation.
They didn't use SQL Live Back just
'cause it didn't exist back then.
But they did have some form of
like disk ANN support that was
built on SQLite, where the nodes
and edges were stored in a table.
It was a complicated thing.
But I think that showed that like
the technology stance, right?
If these companies are building a local
AI stuff and they have all this, all
these resources and they just end up
storing their vectors in SQLite, then
this is a pretty good approach and it'll
probably work in other places as well.
Yeah.
Nicolay Gerold: Yeah.
What I'm really interested in is
especially the, like how far you can
push it, like what are the limits,
especially like of search speed, but
also of index size, especially when
you get into the larger embeddings,
like 1000, 2000, even 4, 000
Alex Garcia: yeah, I think first off
for limits of SQLite in general they
say that you can store terabytes, like
terabytes of data within SQLite before
it actually hits its theoretical limit.
But I think the actual practical
limits are much, much smaller.
I think for SQLite VEC specifically
Right now, SQLitevec is just
brute force search, right?
It's just linear search.
It compares every single vector inside.
There's no approximate
nearest neighbors index yet.
It just does like a brute
force across everything.
I mainly did that for,
one, it was much easier.
Two, SQLite is very the storage format for
SQLite is very specific, and some of the
ANN indexes that are out there just may
not work with how SQLiteVec stores data
within B trees and pages and all that.
So I didn't really want to choose
one where if it, I didn't really
want to choose an ANN index too
early and figure out it was wrong
and then have to change things later.
So I just held off on that at first, but
I think practically the the practical
the practical limits for SQLiteVec,
it's probably going to be like from
what I've seen, you could do tens of
thousands or hundreds of thousands
of vectors where the dimensions are
like, 768 or 1,024, like you could
probably like store tens of thousands or
hundreds of thousands of vectors before
it get, it takes too long to search.
I think what I tried was a 768
dimensional vector vectors.
I could probably store like half a
million of those inside of SQL database.
And the search results would be like.
Around 500 milliseconds or so.
So I think in general, I'd like when
searches searches, especially most
SQL queries to like to take less than
a hundred milliseconds to to execute,
because then there's a lot of other
overhead and that's typically around
like where people can like really
tolerate or before they even notice that
results are like quote unquote slow.
So I would say tens of thousands
to hundreds of thousands.
And if you have like larger dimensions
like, 2000, 3000 dimensions and like
probably even smaller, then probably
like the a hundred thousand range would
probably be a little bit too much.
At least for this initial full scan,
linear search, brute force thing
that SQLite effect does currently.
If you could tolerate it SL live
effect does have quantization support.
So if you wanna do binary quantization
or scaler quantization or Matryoshka
embedding, so you'd cut them in half
or cut them at a smaller distance.
SQLiteVec does have support for that.
And with that, you could probably
like add a little bit more, right?
With binary quantization,
it's, 32 times smaller.
So you could probably do like a million,
binary vectors in SQLiteVec before you
notice some performance issues there.
So that's typically around
like where it is currently.
I want it, I want to get
it in a better place.
I think with a proper ANN index that
will be a part of SQLiteVec in the near
future, I can imagine it doing a million,
maybe like a few million vectors before
you really hit like some, before you
really hit some limits that where SQLite
and SQLiteVec wouldn't make sense.
I think another thing, it
would be metadata filtering and
partitions, which will also be
included in SQLiteVec very soon.
I think that could help
it scale even more, right?
Because even if you have 10 million
vectors if you're only searching
a subset of those you have 10
million, you have a thousand users.
All those users have
a thousand embeddings.
That's, a million vectors.
It might be too much
to brute force search.
But if you could only, if you only
search for one user at a time, you
only search for their 1000 embeddings.
Searches can be a lot faster and
it could be a lot more tolerable.
And I'll be including features
that support that, including
partitioning data and all that stuff.
And metadata filters relatively
soon in SQLiteVec as well.
So yeah, I think currently, all that
to say, currently tens of thousands,
maybe hundreds of thousands of vectors.
If you do binary quantization, maybe
maybe a million but with some features
that are coming down the line for
SQLiteVec, I imagine it could support
low millions, maybe 10 million vectors.
Nicolay Gerold: Shouldn't you already
be able to do like metadata filtering
by just filtering down on another
Alex Garcia: Yeah.
Nicolay Gerold: and then
Alex Garcia: Yeah, you could.
So there's two ways of storing
vectors within SQLiteVec.
One way is a manual way where you
just, store a JSON array of text
or a blob of vector data inside
of a regular column in a table.
And that works just fine.
It totally, there's SQL scalar
functions where you can manually compute
distances and then just do select
all from order by distance limit 10.
And that's a K and N search.
And that works just fine.
There are some performance
implications with that.
If you have very large vectors say if
you have a 2, 000 dimensional vector,
that's going to be like 8 kilobytes
of storage per vector to store, right?
And if you're storing that alongside
your other data in a regular SQLite
table, where that data is stored in
a row oriented format there could be
performance implications there where
like the, the disk head has to like
scan through a lot more data to, to
do that because you're storing these
like large blobs within a table.
So there are some like
performance implications out
there if you're doing that.
So another way of storing vectors
within SQLite is using a virtual table.
So we have a vec it's a
virtual table called vec0.
And it's very similar to a, the
virtual tables that are found in
the SQLite vec full text search
extension and our tree extension.
It's create virtual table and
then there's some configuration
options you could add there.
What that virtual table does is that
you're able, it stores, Vectors within a
SQL wide database in, in, within your SQL
light database in inside of shadow tables.
But it chunks of vectors into
large blobs where it's like a
large chunk of a thousand vectors.
And that makes searches a lot faster.
It stores data completely separate
from your other data, so it's like
a separate table you have to do.
You have to do a join which
could be a little bit awkward at
times, but it works just fine.
And with that searches are much faster
and there's some other like performance
gains we do that because we store
things in any contiguous block, we
can read it into memory lock faster.
We could do things like SIMD acceleration,
stuff like that, SIMD acceleration.
So there's a lot, there's a
lot of benefits to doing that.
But with but the problem with
that virtual table is that there's
currently no metadata filtering there.
That's something that I would
have to implement myself.
And that's something that I will
implement and I'll have typed where you
can like just What metadata filtering
will be in SQLiteVec when this is
completed on the vec0 virtual tables,
it will just be regular columns, right?
So you can declare extra columns
alongside your vector index.
You can do metadata filtering with
WHERE clauses, so it's just like WHERE
submitted at is between these dates, or
WHERE this label equals Red or whatever.
So that will come and that will be a part
of SQL in general, but it's just not a
part of the virtual table implementation
yet, but if you're doing anything, if
you're doing things manually, you could
totally do that yourself and you can
do, if you're storing it alongside of
regular column, you could do normal SQL
queries on top of that and be just fine.
I think another thing is you could
also do pre filtering with both the
manual implementation and the virtual
table implementation, where if you have
another, you can do another SQL query
that just returns a list of probable IDs
or IDs of items that you want to search.
So let's say, for example, I'll get
all the documents that this user
has access to, and then you can
pass in that allow list into into
the query for virtual tables and
vec0 tables, and then it would just
consider those when doing comparisons.
So that is one way of doing it,
but it could be a little bit slow.
It could be not as performant as as some
of the other ways that will be coming
down in the pipeline in a little bit.
Nicolay Gerold: Yeah,
that's very interesting.
I think my mind already jumped
instantly to re rankers as well.
That you actually could support re
ranking with when you have especially
like local search applications or
some form of recommendation system
in the front end, which could make it
way faster, especially if you imagine
like an e commerce site, but the
user is already navigating the page.
So you have some filters available
when the user is navigating
and you have to search query.
And I think this could be like an
interesting addition to SQLite back
as well, like adding some re ranking
capabilities could be very cool.
Alex Garcia: Real quick, do you mean like
a re ranking model or just re ranking a
binary quanti a bit vector and then re
ranking it with the full size vector?
Nicolay Gerold: No, re
Alex Garcia: Oh, re ranking model?
Okay, yeah.
Okay.
Nicolay Gerold: In the end, the re ranker
often integrates like multiple different
factors, so it's often combined also
with for example, a score for the recency
of the data or a score of like past
interactions, for example, of popularity.
Is a common score that you actually add to
the different items and you basically can
implement like or add business logic to
the re ranking as well that you don't just
have basically the query and document.
Or query item re ranking sort, but
you add like more of the business
logic scores also on top of that.
So you can optimize for certain
Alex Garcia: Yeah.
I wasn't aware of that.
I always thought that re ranking models
just re rank based off content itself.
And I always thought that, that value
was limited, but if you could like
include other metadata, like popularity
and stuff, that's pretty cool.
That's it sounds like it could
be a main part of SQLite pack.
That'd be interesting.
Nicolay Gerold: That's, yeah,
and there are like two ways
of implementation in the end.
One is you actually implement
or train it into the re ranker.
By actually using like the popular
items in your training data set.
But in the end, there's like biases
it towards like very popular items
all the time, which is something you
often don't want to, and you can also
basically just use it as a weighted
score and a weighting function, which
is often a little bit more controllable.
Alex Garcia: That's so cool.
Yeah.
Nicolay Gerold: Nice.
And I think for me, it's with
embeddings when I'm working with
them, I'm often fine tuning.
Models and keeping it up to date because
you have some data drift in practice.
How could I approach that with SQLite Vec?
And especially when I'm assuming I
have it deployed on multiple different
devices or multiple different customers.
How can I actually keep the
embeddings basically up to date
with the most recent model?
Alex Garcia: That is a good question.
So it's a different embeddings
model that you fine tune and train.
And like you insert some vectors and you
embed some stuff on March 1st and then
on March 14th, you train a new model
on those embeddings and you want to
go back and update those, the previous
embeddings you had with the new model.
Okay.
Yeah SQLite in general is just SQLite
at the end of it and just pure SQL.
So you could go back and update a table
or a virtual table that has vectors
in it with new values that you have.
You could also have triggers as well.
So like when the, so for example,
if you have a, if you're embedding
comments from a YouTube video and
you have a column called comments,
and that's the actual like text
comment that someone left on a video.
And you're doing some vector
search on top of that.
So you're storing those embeddings
of those comments in a separate
virtual table with SQLiteVec or just
in another column that's a JSON array
or a blob representation of a vector.
You could always use triggers.
So whenever the whenever you have
to update the comment or whenever a
comment is updated, you can have a
trigger that goes and updates that
virtual table or updates that other
column with a new representation of
what the embeddings of a new embeddings
representation of the new comment.
And I think if you are syncing with
a new fine tunings model or if you
change the entire embeddings model
in general, I think an update like
that would probably be what's needed.
And as long as the dimensions match up
and all that, it should work as expected.
Don't know if I have a
good answer to that one.
Yeah.
Nicolay Gerold: Yeah, I think that's
one of the, especially in the local
first era, that's the most challenging
part, like the syncing between like
different devices, between the user and
the backend and how to actually make
that possible without like overrides.
If you have multiple users working
on the same data, which is a very
Alex Garcia: Yeah, there are a few
tools and features that people have
built that like wrap SQLite to do,
some synchronization work where you
have, multiple people with their own
like version of a SQLite database.
And then it eventually all gets
synced back to a main, master copy.
I don't know what the status of
those tooling of those tools are.
Some of them could be like quite
complicated to build and they
may not have a lot of funding.
So I don't know exactly if there
are any good options out there.
And I guess I don't know how well they
would work with SQLiteVec in general
because SQLiteVec uses virtual tables.
You might be able to sync the
underlying shadow tables where
the data is actually stored.
But I've just never tried that myself.
I think there are a few like good
backup options for SQLite where it
will continue to continuously replicate
what you're working on into a, into S3
or some other, replication tools that
are out there specifically for SQLite.
SQLite does have some form of There's a
session extension where you can like it's
not necessarily like forking a SQLite
database, but if you have your own copy
of SQLite database and you make edits to
it, and then you want to sync those edits
with another, with someone else and have
them, merge together and then raise an
error if there's a conflict or whatever.
There is a session extension
that does part of that.
It's not very well used.
I don't know how, like, how often it's
used in like real world environments.
But there are some ways of doing
that and some ways of synchronizing.
But I think with since SQlite-Vec
if you're using the virtual
table implementation, it might
be a little bit hard to work
that into some of those tools.
Nicolay Gerold: Yeah, I would really
love to try to build like a local only
recommendation system, which basically
collects the user data only in the
front end, adds it to a vector database,
gets the items, embeds them, and then
basically does the recommendation.
This would be a fun side project
Alex Garcia: Yeah, I think I usually
get around like synchronization issues
by just not synchronizing and just
having it like completely local, where
you don't need to like share with other
people or do it or do anything like that.
Just like try to make it more
more private if possible.
That also just sidesteps the entire issue.
Maybe not the best approach, but yeah.
Nicolay Gerold: Yeah.
What I would love to hear, what
are like your go to Models on when
you're building the local stuff,
are you using custom models as well?
Or are you rather tending to go with
something like cohere or open my eyes
Alex Garcia: I try to do open source
models as much as I can, and I avoid
OpenAI and Cohere mostly because being
able to write it myself, I find can
be a lot easier and a little bit more
future proof and you don't have to pay
for it, which is great for cost and all.
When I do choose an open source.
I typically choose a sentence transformers
model mostly because those can
translate into a GGUF format for llama.
cpp, which is usually the inference
model or the inference engine that I
use when I do embeddings models, whether
that's with llama file, which is a
Mozilla AI project where you can run,
local models with You can run local
models including embeddings models
and generate embeddings through that.
That takes gguf model files,
and I like those just because
like it's much easier to run.
It's just a single file, SQLite.
The inference engine itself, llama.
cpp, it's just like pure C It's very easy
to compile and use in other projects.
I also have another SQLite
extension that's like a sister
project to SQLite Vec, called
SQLite L Embed, which uses Llama.
cpp to generate text embeddings.
That's usually what I use as a go to,
because I'm already in the SQLite land.
A gguf model is fine, and Sentence
Transformers does pretty well.
You can convert a center transformers
model into a gguf format quite easily.
I typically choose the really small one.
I forget the name of it just because
it's a bunch of like numbers and letters.
It's like the all mini L2
version 6 something or the other.
That one's it's 30 megabytes.
It quantizes very easily.
The only problem with that one is
that the context size is fairly small.
It's only a couple hundred.
Tech tokens, I believe.
But I like it because it's
very small, quick and easy.
And it was trained a while ago.
So it doesn't know a lot
of like recent stuff.
So if I do want something that's a
little bit more recent that knows
things about COVID 19 or anything like
that then I have found Salesforce has
a few embeddings models recently, I
think it's called like Arctic embed.
No, I'm sorry.
That's Snowflake has a few embeddings
models called Arctic embed.
The recent version 1.
5 is fairly nice.
It does have some support of Matryoshka
embeddings and a few other like
nice features that come with it.
A little bit larger, a little bit
more difficult to run just cause it's,
instead of a 30 megabyte file, like the
other one is this one's like probably
in the couple hundred megabytes size.
But that one is the, it's
trained on more recent data.
It works with more, if you're working with
for example, I was working on a little
project that does news article headlines.
Scrape the last couple thousand headlines
from NBC news for American American
newspaper, and they're doing searching
searches on top of those headlines.
And those are fairly recent topics.
Some older embeddings models may
not know about it, but this Arctic
one seems to know a lot about it.
So it does like pretty well.
I also try out Nomic has
a few open source ones.
Mixed bread has a really good
embeddings model that does
binary quantization very well.
So whenever I can, I just try
an open source model, just
because like they're easy to run.
I, they're free.
You don't have to worry too much about
heading in another API or whatever.
I think another nice thing about using
a local model is that you don't have to,
it typically, it might run faster not
because it's running on your own hardware,
but because there's no network delay, or
you don't have to worry about API keys or
sending in a REST API request or whatever.
So keeping it local kind of makes
that part a little bit easier.
And less likely for there to be
bugs or crashes or whatever else.
Nicolay Gerold: Yeah.
How have you experienced like the
system demands when you're building
something, especially user facing and
when you have multiple different models
running, you have one file for like your
transformer model for generating text
and I learn and one embedding model,
maybe even some stuff on top of it.
Have you seen any interesting strategies?
Swapping in and swapping out different
models into memory or something
interesting how that is handled.
Alex Garcia: Not that I've seen.
Typically the most I have running
is like you said, like maybe one LLM
if it doing like a rag application.
So one LLM style model where it's
maybe it's a llama or something else
that does like text generations and
another model that's strictly for
embeddings, whether that's snowflake,
Arctic bed or whatever Typically the
embedding models are fairly easy to run.
They're like probably less than a gigabyte
and more so even smaller than that.
So it doesn't take up
a lot of memory or RAM.
So you don't have to
worry too much about it.
If I was running multiple
models at the same time
Yeah, I don't know if there's a
lot of good strategies for like
swapping on memory and all that.
I would just shut down the instance
and or have a separate server.
I think one thing one nice thing
about doing local stuff is there's
a lot of other tools out there that
kind of make it easier to talk with
like other devices on your network.
So my, I develop on a 2019 MacBook.
It's the old Intel processor.
It's fairly old now.
It doesn't have a lot of RAM.
I can't run like Lama on it or anything.
Yeah.
But I do have a newer Mac mini
that sits on my desk all the time,
and I could always SSH into that
and run another model, right?
Or I can use tail scale to access
to access my Mac mini if I'm out at
a coffee shop or whatever, right?
And that could run Lama and I
could run the embeddings model on
my computer and the the network
delay may not be too bad, right?
And that's one way to just like
using, if you're running stuff
on device, you probably have
most devices on your computer.
Maybe you could use some of them to do
like a hybrid mixture of things, right?
Or of course you could always
pay someone like open AI or or or
Anthropic to host a larger LLM model
for you if you wanted to do that.
So there are like ways of either
using multiple devices or just
using another server if you have
like multiple models running.
But I don't know about if there's
a good option for swapping
models, like in memory and stuff.
Yeah.
Nicolay Gerold: Yeah on the topic from
before the current models I think I,
I still have a strong preference of
using models before 2021, especially
embedding models, because I'm still
uncertain about the impact of all the AI
generated content on the web which is now
basically in all the different models.
So I really liked like all the BGE models.
Which are fairly good if you
actually fine tune them as well.
Through which you actually get
the relevant information or
knowledge into the model as well.
Alex Garcia: Yeah.
I think fine tuning is probably
required for a lot of applications.
I think when you're developing or
doing prototypes, you don't really
need that stuff all the time.
But yeah, like as soon as you have an
actual use case for it and you have an
evaluations data set and you just need
to get like something out and done and
like actually in front of customers or
users, then like fine tuning is probably
going to be required for whatever you do.
Yeah.
Nicolay Gerold: Yeah.
Anything beyond that in your stack?
Especially like you mentioned Lama file.
Probably transformer JS for all
that sentence transformer stuff
when you're in the front end.
What else in your tech stack
or in your go to tech stack?
Alex Garcia: Yeah.
There's a few there's always like
developer friendly tools that are used.
Simon Wilson has a LLM command line
interface where you can interface
with multiple different, either hosted
LLMs, whether it's OpenAI or Anthropic
or whatever, but also local models
as well, and also local approaches.
Whether that's like MLX or Llama.
cpp or PyTorch or whatever, I think
that command line interface good
way to try out different models just
to run things on the command line.
I think for building applications
themselves, yeah, I typically
default to like Lama.
cpp mostly because it's easy to run
on device and easy to compile and use
compared to like things like PyTorch
or TensorFlow, which can be like,
at least in my past very difficult
to compile or run on many platforms.
So having something that's one
binary, one one model or one file
for the model itself with Lama.
cpp, that's usually my go to.
Transformers.
js, definitely a big one for
working inside the browser.
There's a lot of different text
generate, or there's a lot of
different embeddings models that run
in the browser through TensorFlow.
js that I use whenever I want to
have a demo of SQLite running in
the browser or just having a, a
semantic search in the browser.
That's usually my go to.
I think some other projects, Llama.
file, Llama.
file is a big one.
Llama file also has whisper file as a
new as a new feature where you could run
whisper on a single binary that runs on
multiple operating systems, which makes my
life a lot easier when working with audio.
Yeah, that's all I can think of right now.
Nicolay Gerold: Yeah,
I saw the whisper file.
The whisper file is really interesting.
Have you played around yet,
especially with VLMs as well?
Like vision language models on
operating on images and other
Alex Garcia: unfortunately not.
I think I think the reason why is
because Llama, CPP doesn't have
good support for vision models.
They had initial support for one or
two of them in the very beginning,
but they slowly just got deprecated
or bit right away where they
don't really quite work anymore.
Which is quite disappointing because
I think in the last like month or
so, there's been a lot of really cool
vision models that have come out.
To use them, you have to use
PyTorch or a pre configured Colab
notebook to actually use it.
And I've just avoided that just cause
I've wanted to run things locally.
But some of them look really cool.
I think.
Vision models are probably
going to be like the next big
thing for the next few months.
Whether that's specifically trained
for OCR or handwritten text or like
image recognition and all that.
There are a few clip embeddings models
that I've played with and SQLite Vec, of
course, it works with them because at the
end of the day, they're just embeddings.
you could store inside
of any vector database.
But yeah, I think there's going to
be a lot more movement in that area.
A lot more models, which
I appreciate and I love.
I just hope they get easier to run and
they're easier to run on other platforms.
Because I think even like TensorFlow
may not have great support or TensorFlow
JS, or sorry, Transformers JS may
not have good support for them yet.
Just because the inference on
vision models are much different
than, embeddings and text models.
And it may take some time for
them to really support that.
But yeah, I just hope they
get easier to run soon.
Nicolay Gerold: I would also expect
that vision language models in
general, they have to get like
bigger and bigger first before they
can get, Smaller and better again.
Alex Garcia: Yeah.
Same thing with like text models, right?
They got bigger and bigger.
And now I think they've figured out
like some good strategies to slim
it down or better training data.
So now you know, things like the GPT 4
mini are much smaller and faster to run.
Or some of the new LLAMA models that
are smaller parameters are better than
the ones that were, like large, larger
parameter ones from even two years ago.
So I think I think that
little hump is coming soon.
The little hump that happened
with text models a few years
ago, it's probably gonna happen
soon with vision models as well,
Nicolay Gerold: yeah.
Have you actually benchmarked SQLite
and SQLite VAC against the other, like
more common for tools for local first
vector databases or local first search,
like for example, Chroma, Sync, but
also if you're running on the server,
FICE is like, it tends to be my go
Alex Garcia: I have run benchmarks.
I don't know if I completely trust
them because benchmarks can be a
very sensitive topic for for a lot of
people and a lot of other projects.
But I'm confident in saying that
SQLiteVec, with a few parameters tuned, is
much faster at brute force vector search
than a lot of other client libraries.
With the exception of FICE, FICE is just
so much faster than a lot of other ones.
If you're doing brute force, FICE
is much faster than SQLiteVec.
There might be like one or two other
vector search, like C vector search
libraries that are faster than SQLiteVec.
At least in like brute force,
all in memory, all that stuff.
I think some other libraries out there,
they don't really like, they're not
optimized for linear scans or for brute
force scans because they care about ANN
index speed and recall and all that.
So in that case, SQLiteVec isn't faster
than any of those because SQLiteVec is
brute force and any ANN index is going to
be much faster than a brute force search.
But I would say that like in general,
since SQLiteVec is much smaller,
easier to run and easier to install.
And also inserting data into SQLiteVec
is fairly straightforward because again,
it just stores it one after the other.
There's no it doesn't have to
navigate through an HNSW graph or a
disk and graph to like insert stuff.
Usually insertion must faster and usually
searching in KNN queries are faster in
SQLiteVec than some of those other tools.
Again, for brute force
for brute force searches.
Nicolay Gerold: And also I think
what's an unappreciated aspect is like
that you have an actual data model.
Which often makes the iterations way
faster because you have to think about
your data first of how it looks and you
also can easily run all the different
filters, pre filter, post filter with the
additional columns you have in SQLite.
Alex Garcia: Yeah.
And And also one nice thing
about SQLite and SQLite Vec is
that it just does less things.
So there's less things to slow it down.
So for example the actual like
storing of data and writing to
disk and all that, SQLite Vec just
uses SQLite to do that, right?
And SQLite has been optimized
for very fast disks writes and
reads for the past like 20 years.
Whereas some of these other like
vector databases or vector, local
first vector tools that are out
there they had, they had to implement
everything from scratch, right?
They like re, they, they do their disk
access they wrote themselves or they
have transactional stuff they themselves.
They have to write their
own database, right?
Whereas SQLiteVec just use a pre
existing database that's out there
that's optimized for all the speed
and inference or optimized for speed
and quick writes and all that stuff.
So that is one nice thing about SQLiteVec
is that because it's on SQLite a lot of
things just run much faster because of
it, because it's a proven technology.
And we don't have to spend a lot of time
like deserializing or thinking about like
how we're going to store things on disk
mostly because SQLite already figured out
most of that for us, if that makes sense.
Nicolay Gerold: Yeah.
And what would you actually say
what's missing from the space?
Alex Garcia: I think for, on the vector
search and vector storage side Not much.
I think in the next few years, there
might be some like better ways to like
store vectors or a better and an index
that someone comes up with that's
faster or has better recall or whatever.
So I think on that side I
don't see much happening.
Again, there might be some, new
quantization techniques or new
ways to storing query vectors.
But I think the real innovation in
this space is probably gonna come
on the embeddings models side.
So with these embeddings models, if
they become smaller, faster, easier
to fine tune, easier to use, I think
that's gonna have a way bigger impact on
vector search, semantic search engines,
recommendation systems than anything that
like the vector database can do itself.
So for example, like if I can, If it's
easier to run a embeddings model that
is perfect, that is, less than 100
megabytes can run on any platform is very
up to date and knows information from,
from just a few months ago, whatever, I
think that's going to have a way better,
bigger impact on AI applications than
anything a vector database can do, right?
Because if a vector database, sure,
things will get a little, like a K and
N search may get a little bit faster.
Sure.
The recall rate will get better if you're
using an A and N index or whatever.
But I think like the real innovation
in this space is probably going to
come from like just better models,
faster models that are easier to
fine tune or use, or restrict to
your own, to your to your use case.
I think that's where most of the
innovation is probably going to come from.
Yeah.
Nicolay Gerold: Yeah.
I think you'd like the fine tuning part.
I think fine tuning embeddings
model isn't that difficult.
It's more about creating the
data set is very hard and I
don't think that will go away.
Alex Garcia: and I think on my
side who's never done fine tuning
and who has found all the guides
like a little bit intimidating.
I think that's why I think, oh, like
it should be easier to fine tune.
Whereas I think someone like you who's
done in the past or knows about, know
who has done fine tuning in the past
knows how easy it is, knows how whatever.
So I think like it might be a little
bit on yeah, it just might be a skill
issue on my side where I've never tried
actual like fine tuning or anything.
But I also like.
Nicolay Gerold:
Everything's a skill issue,
Alex Garcia: And also I think
it would be cool to do fine
tuning in the browser, right?
Or fine tuning in different languages.
'cause I imagine, maybe I'm also
wrong on this, that fine tuning is
usually like what a CoLab notebook
that you spin up that's like only
in Python that does all this stuff.
But if you could theoretically, if
know point, like point like a fine
tuning command line interface or
upload something to a web interface
that's just like the data that you
have with maybe some instructions.
We don't have to like, use a.
A Colab notebook or use Python.
You could use your own
programming language or have
a different interface for it.
I could imagine that.
Would make fine tuning
a little bit easier.
Maybe it's not possible.
Maybe I just have zero clue and don't
really know like how fine tuning works,
but I think something like that can
make things a lot easier for most
people because I would imagine that
like most AI application development
out there or application developers,
if they don't have experience in Python
or like fine tuning or even just like
training models themselves, I imagine
it's a little bit more difficult.
But I could also imagine a world where
there's, easier ways of fine tuning
through like a white browser or something.
Maybe I'm wrong.
Maybe that does exist.
Maybe again, skill issue, but
Nicolay Gerold: yeah, I think like the
bigger issue at the moment, all the
libraries focus on inference for the
local stuff because training is the type
of workload is very different because
you tend to have batches and You're
adjusting the parameter weights, which is
especially in Lama file, since it's like
one file, the parameters are hard code and
you would have to basically adjust them.
I'm not sure how easy it would be
to basically integrate all that
logic, for example, in a Lama file or
something like that, but that's for
smaller people than me to figure out.
Alex Garcia: Smarter than me as well.
The stuff they're doing in
LlamaFile is absolutely insane.
I have goes way over my head, but yeah.
Nicolay Gerold: I tried reading into it.
It's crazy.
What do you think like
overrated underrated?
I always like to ask that what's an
unappreciated and what's an overrated
Alex Garcia: Yeah, I think
underappreciated definitely like binary
quantization looking at all the different
like quantization tools out there,
all the different ways to make your
embedding smaller binary quantization
just to me seems so easy and works out
pretty well, especially if your vectors
are like large dimensions, right?
So if you have a thousand dimensional
vector a thousand if you have a
thousand dimensional vector and you
do binary quantization, all that
it is much easier to deal with.
You can store in a
regular database column.
You don't have to worry about a lot of
times you don't have to worry about a
dedicated storage that in a dedicated way.
Like for example, a SQLite Vec
virtual table, you could just store
it inside of a regular table and be
just fine because the actual size
is just a couple hundred bytes.
It's faster.
It's much easier to deal with.
It's faster to search.
It's much easier to generate.
It makes a lot more sense in my head than
some of the other quantization methods.
And the performance is pretty good.
And I think embeddings models can also
better embeddings models can produce
better quantization methods, right?
If those embeddings models train on
binary quantization loss, then it
just works out a little bit better.
It works out better for you.
I think overrated, I don't know
about overrated, but I think when,
technique that I haven't done as
much has been matryoshka embeddings.
So that is when you have an embeddings
model that is specifically trained
to generate embeddings that have
like different break points.
You have a thousand dimensional vector
and you also have another break point,
a matryoshka embeddings target for
500 dimensions or 250 dimensions.
Which in practice means after you get
that, once you embed something and you
have that thousand dimensional vector,
you can cut off at those different places.
So you can cut off and just have
the first 500 elements or the first
250 elements and use those to store.
And the the loss that you get
from that isn't that bad, right?
You could just, uh, theoretically,
you could just store those first
500 or first 250 elements, save on
a lot of space and just use those
for for different comparisons.
That's great.
And I love seeing it, but I think.
But I think the space savings you
get from that isn't that much.
So for example, like if you just store
the first 250 elements in a thousand
dimensional vector, you only save one
fourth of the space, which is great.
And you can maybe do some other stuff
like scaler quantization to get that
down to one eighth or something.
But with binary quantization
it's 1 32, right?
1 32 of the space you use.
From a full floating point vector into
a binary vector that has like way more
space savings and the Performance game
might be comparable if you're using a
good embeddings model mixed bread AI
They claim that their embeddings model
that was specifically trained on binary
quantization Holds up at 95 percent
quality when you do binary quantization.
So 95 percent Quality when you chop
off and we only contain 130 tooth
132 Tooth How do you say that?
One.
Nicolay Gerold: one
Alex Garcia: One thirty second, yeah.
One thirty second of a space
in a bit vector compared to
the full floating point vector.
Whereas like Matryoshka embeddings
may not even be able to do that, and
their quality is probably a little
bit worse than 95, 95th percent.
That, that's the reason why I think
binding quantization has a lot more
room to grow, especially because
that can be specifically trained and
easily trained in embeddings model.
As compared to Matryoshka embeddings where
you may not get as much space savings and
probably have similar levels of quality
after you do those, after you cut that up.
So yeah, I think like binary quantization
is probably going to be the future,
way easier to train, way easier to use
and in fact in fact, the databases and
Matryoshka embeddings, I imagine if
I see some numbers, maybe I'll change
my mind on that but I typically reach
for like binary quantization first
before I even consider any Matryoshka
embeddings or anything like that.
Nicolay Gerold: Yeah, they also
have limits in terms of fine tuning.
So it's way harder to fine tune
something like Matryoshka embeddings
than just regular embeddings, which
you then basically quantize post hoc.
And I think in fine tuning you like
to have the choice of what to do.
And you could basically do
the quantization already
doing the fine tuning as well.
But you can also apply it after the fact.
And I think that's the beauty of it.
Alex Garcia: Yeah.
And I think And also, With with
Matryoshka embeddings, I know snowflake
and their Arctic embed version 1.
5 embeddings model, They talked about
their initial version didn't have
Matryoshka embeddings and the added
Matryoshka embeddings in that 1.
5 model.
They did mention how they think
that the They only trained on
like one Matryoshka target, right?
I think it was like the full
vectors I want to say 1024, and they
specifically trained for I think
256 for their Matryoshka embeddings.
They didn't have multiple levels
like some other models do, right?
OpenAI, they trained on four
or five different targets.
Whereas like the Snowflake 1
only trained on, the Snowflake 1.
51 only trained on one.
And I forgot the reason why they did that,
but they do, they did have a good like
comparison of oh, when you have too many
targets, the quality isn't good as good.
Something like that.
They explained it more in their blog
posts of like why they only trained
on one specific Matryoshka target.
So I think there's still a little
bit more to figure out on that
space for Matryoshka embeddings.
Whereas like binary quantization, again,
like I think it just works out of the
box and it's probably easier to like,
if you're fine tuning, Transcribed I
imagine it's easier to like include that
as a loss function when you're doing
that so I'm not that's my hunch at least
Nicolay Gerold: What is something that
you actually would love to see built
that would make your life easier?
Alex Garcia: I think I think making other
Machine learning models easy to run.
I think one thing that one nice thing
about these, the LLM craze in the past
few years and local AI craze is that now
it's a lot easier to run those types of
models like text generation and embeddings
models locally as they were before.
I think like those types of
models, in the 2018, 2019 era.
You had to install PyTorch or install
a thousand dependencies and compile
something and it would take like a
long time to really do and it'd be
like quite difficult to do, to be
honest, especially if you're not
a Python developer but now longer.
cpp, it compiles to a single binary
these embeddings models and these,
for gguf files, it's just a single
file that you have to pass around.
So now they're like much easier to
run and it doesn't require a GPU.
It can work on all CPUs
and all that stuff.
So I think that's one nice thing
about AI and LLMs that happened that's
now it's just much easier to run.
But I think that hasn't happened for
other types of machine learning models.
I think one thing like vision models
that we talked about before those
are still pretty hard to run locally.
I hope that they get better.
I also think other models like natural
language processing, some NLP models
that are That were difficult to run
in the past and are still difficult to
run, or at least they don't work on all
platforms like these other ones do for
example, a glider, the one the, a new
NLP model that does some transformer
space thing I would like that to.
I'd like to see that be able to
run in different platforms, whether
that's like in the browser or in a
C based model that could be embedded
into other applications, because I
believe right now it requires Python.
It requires a pip install and
probably PyTorch and a bunch
of other large dependencies.
So I hope that like other machine learning
models, like NLP models, like these
vision models will get easier to run.
That's like my one wish for, for
the next few months and years.
Nicolay Gerold: probably yes.
And if people want to follow
along with you, hire you, see
what you have built, where
Alex Garcia: Yeah, so I think if you want
to check out SQLite VEC if you just look
up SQLite dash VEC, that's dash V E C.
I imagine it'll be on Google.
You could also go to my github.
com slash A S G 0 1 7.
That has a link to, SQLite
VEC and whatever else.
I also have a website, alexcarcia.
xyz if you want to check that
out, that has the links to, all my
different socials and some of the other
projects that I work on and stuff.
If you want to get in contact, you
could find my Twitter through there
or send me an email or whatever.
Nicolay Gerold: So what can we take away?
I think first of all, the
performance boundaries of SQLite
backup are very interesting.
Like he mentioned up to 500, 000
vectors with like 768 to 1024
dimensions are very doable in sub
500 microseconds, which is basically
the threshold for human perception.
So that's like really perfect.
And that's a really large Index, if you
consider you're running on, on the edge
or on the customer device, because nearly
no customer will probably have a data
set or a knowledge base that is as large.
And this is for brute force search.
So there isn't an ANN implementation here.
So this will become really interesting
down the road as it becomes more
efficient, like how much can you
actually do on the edge or on the edge.
Um, smaller devices and the optimal
use cases really are like medium scale
applications, which are local first or
where privacy is an issue and where the
subsecond search is still acceptable,
but I would expect it's probably really
focused on certain domains and on
consumer devices that you actually have
some knowledge that you want to store on
device because you don't want to share it.
Or because it's just faster.
So you could also come up with a strategy
where you just cache a set of data on the
customer device and search on top of that.
And in the meantime, while you show
the first few results, you basically
query your own main database.
And then once you get the results back,
you basically enrich the cached results
with the results from your main database.
I think this could be a really interesting
pattern because you would come away with
a lot more stuff where you actually don't
need as much distribution across the
world if you have a global user base and
stuff like that, then you actually have
to worry about, okay, all the networks
delay delays that could, you could
encounter also just general performance
optimization tips, like use binary
quantization for really large data set.
If you're doing it in SQLite
Vec, use metadata filtering.
So you can do subset searches to
make it faster and more efficient.
Consider basically partitioning
your data as well by user
or by some form of category.
And also just do performance monitoring.
Monitor the search latency
by different vector sizes or
different types of queries.
The
the complexity I think comes into
play with local development workflow.
So when you have that local database,
that you have to consider way more when it
comes to updates to the database, because
you actually have to do the migrations
on each of the customer devices.
And for that, to be honest, I would
be over asked, like, what exactly has
to happen there when you, for example,
have a schema migrations in one of
your tables, that you can actually
push it to all the different devices.
You actually can do some really
interesting experiments and
start way simpler with search.
Like if you start a search with a SQLite
database and on the customer device or
just running it embedded on an endpoint.
There are different solutions for
that, obviously, but if you run it
embedded, it's a very cheap way of
actually experimenting and also putting
something in production because the
Environment in which you run it So
basically the sqlite Database, which
is embedded on your user's device, but
also And the sqlite database that is
on your device when you're developing.
It's the same environment And which
often isn't a given in ai and ml so you
really can take it And experiment locally
and then put it into production and
it should behave like mostly the same.
And I think that's a really interesting
pattern that really allows you to start
simple and add complexity as needed.
And once your application really
like hits it off, you can actually
decide, okay, which database should I
choose based on the query patterns you
actually see in production already.
So I think you can really, you
have a more natural progression.
You can start simple, you can do.
a lot of the different types of
searches like keyword burst, you can
do hybrids, some re ranking, and then
you monitor the query patterns and
over time as the load gets too big
for SQLite, you basically move on and
pick Elasticsearch, OpenSearch, Vespa,
Quadrant, depending on your use case.
We will continue next week with
an episode on quantization.
So Alex talked a lot about it.
You will be going especially
into product quantization and
binary quantization next week.
So stay tuned for that.
And otherwise I will see you next week.
Listen to How AI Is Built using one of many popular podcasting apps or directories.