S2E22

#039 Local-First Search, How to Push Search To End-Devices

January 23, 2025 · 53:09

Nicolay Gerold: SQLite powers
nearly every app on your phone.

It's a database in your messaging
app, in your calendar, but also in the

browser and countless other applications
which are running on your devices.

It's bundled into Python and
it's also only one install

for example in Ruby or Node.

And its magic is in its simplicity.

It's one file to backup,
to move, or to version.

There is no complex deployment.

There are also no state
management headaches.

But up until recently, search in
SQLite was restricted to keywords.

But now we can build semantic search and
hybrid search right in the local database.

And today we are talking to Alex
Garcia, who is the creator of SQLiteVec.

And we talk about how SQLiteVec works,
the optimal use cases, and how you

can optimize it for your use case.

And And instead of basically having
a massive central search database,

each device can maintain its own
search index and the computation

happens on the user hardware.

So results are near instantaneous with no
network delay and also private data never

leaves the device, which is probably the
main motivator why you want to do this.

And this really shines for end
consumer or really privacy.

Aware or sensitive applications
like in medical or health care

Let's do it

, Alex Garcia: it's row oriented storage
it's, it doesn't have a good compression

or good Some of the other benefits
that you get from column oriented

stuff, you don't get with SQLite.

And also since SQLite is divvied up into
pages of data, and that's how things are,

like, written and stored most of the time
that doesn't matter for most applications,

but for SQLite, in fact, it does because
and I guess we've talked about this later,

but SQLite VEX stores all vectors in
like big chunks, like giant blobs that

are like, a couple of megabytes in size.

And the problem with that is since
page sizes in SQLite are like

four kilobytes each, that means on
disk, it's stored in like several

different pages all throughout.

It's not like in a
contiguous block of memory.

So there are small quirks like that,
that make some like analytical stuff

a little bit hard to do in SQLite.

But like the benefit is that if you're
doing like, transactional stuff, we have

an application that has many writes.

It's fairly, it distributes
it quite easily and it could

do writes a lot faster.

So that's usually the main benefit between
DuckDB and LanceDB for that kind of stuff.

That when you're writing an application,
it's usually faster, especially writes.

But it's, but it is a little bit weird.

The documentation for SQLite and
how they like the file format

storage is it's quite like verbose.

Like they, like every single byte in the
header page and all the other pages that

exist in it, you could, there's a lot of
documentation for it, and there's a few

uti There's a few tools in the command
line interface where you can like view

every page, see what data's in there.

B trees, leaf nodes, all that stuff.

So it's quite it's a little bit
weird, but I guess since it was

built for like transactional stuff,
it makes a little bit more sense.

Yeah.

Nicolay Gerold: Yeah, so why
did you decide to do SQLite Vec,

especially on SQLite and not
choose one of the other dataforms?

You could have done it probably in
Wasm, or could have done it in LanceDB

as well, on Lance or even Parquet.

Alex Garcia: Yeah.

I think I use SQLite Vec a lot for
other projects like data analysis

stuff and pipeline work and,
even small little applications.

So I was already using SQLite
a lot for that kind of stuff.

So having so building something
for that made a lot of sense.

And I think with SQLiteVec
compared to a lot of other database

technologies out there it's fairly
small, it's very lightweight.

The extensions it's very easy to build
extensions, whether that's, scalar

SQL functions or table functions or
virtual tables and all that stuff, the

documentation is quite nice and it's
been around for a while and it's quite

stable and I didn't go along thinking,
oh, When I started SQLiteVec, it wasn't

like, Oh, I want to build vector search.

Let me find a database for it.

It was much more like I'm a SQLite guy.

I use SQLite for a lot of projects.

I want an easier vector search
thing that I don't have to install,

10, 000 dependencies to use.

I just want something lightweight
and something that works in my

current workflow, so that's how I
stumbled into SQLiteVec specifically.

I use SQLite most of the time just
because of how easy it is to use,

it's already bundled into Python.

For most other programming
languages like Node or Ruby,

it's just like one install away.

So it's fairly easy to set up as opposed
to some other databases that are, a

little bit larger requires way more setup
commands or a separate server or whatever.

So that's most of the, that's most of
the reason why I had, why I was already

in like the SQLite world beforehand.

Nicolay Gerold: Yeah, and especially
SQLite, why I love it as well, in

the past few months, is just, you
can just have a local copy of the

database and you can experiment,
which is, just isn't there for.

Most of the other

Alex Garcia: Yeah.

And you can just, it's
a single file, right?

So you can like copy and paste
it if you want to make a backup.

That's not like transactionally
safe, but you can do it.

It's like completely fine.

So it's fairly easy to run a
quick experiment, destroy it if

you have to, back it up by just
like uploading the DB file to like

Google Drive or S3 or whatever.

So yeah, that's definitely another main
reason why like I usually reach for

SQLite just cause it's so easy to set up,
don't have to do much, just single file.

Very simple.

Nicolay Gerold: Yeah.

What else, like some of the use cases or
like some of the fun stuff you tried out

with SQLite vector, especially with like
local search, local AI on top of SQLite.

Alex Garcia: think most of my initial
projects with SQLiteVec have just been

like a better search engine, right?

So SQLite already has full text
search and I already built a lot of

applications in the past, whether it's
like personal stuff or for clients.

That is just like a straight
up search engine, like keyword

search, all that stuff.

But of course it's just
keyword search, right?

There's no like semantic search,
there's no fuzzing, or like deriving the

semantic meaning out of it or whatever.

So when I built SQLite Vect, that was
the main thing that I would do, was if

I had a pre existing search engine, add
vector search to it, and then you get

like all these cool new results from it.

That's been my main focus, just because
of how easy it is, and how it integrates

very well with the pre existing SQLite
full text search extension that exists.

So that's been like my main focus.

I think there's other fun little
projects that I've done where it's

building a small little classifier,
like a text classifier on top of

the vector search and embeddings.

Again, very easy to use in SQLite
just because if your data is

already in SQLite, it's very easy.

And if it's not, it's very easy to
get your data into a SQLite database.

And once it's there, embedding those and
building like a little classifier on it to

get like the K nearest neighbors, and then
find a label of, the most common label

of the closest 100 vectors or whatever.

Those have been, that's
been my focus so far.

I think also because I have a
background in Data analysis and

data engineering kind of tasks.

So that's usually what I go for when I
build whenever I'm using SQLite, it's

usually for some data analysis project.

So building a search engine
or building a small little

classifier makes a lot of sense.

I don't do a lot of, and I occasionally
do, but that's my main focus isn't in

like building AI applications, right?

I've seen other people and I've heard
other people talk about building,

a command line interface that uses
SQLitevec to store embeddings, right?

Because it's how easy it is, how it's
like on your computer right there.

Or maybe it's like a little web server
that like does some sort of embedding plus

storing and SQLitevec and stuff like that.

So on the transactional application side,
I've done, small little projects here and

there, but like my main focus has always
been like, build a search engine, build

a classifier, do some like recommendation
related stuff with SQLitevec.

That's been mainly my focus there.

Nicolay Gerold: Yeah.

And where are those actually embedded?

Are they running like in web
applications in the front end are they

running in the backend, embedded in
one of the CloudRuns, CloudFunctions

Alex Garcia: So I think initially people
would just run it, SQLiteVec when it first

came out, since it was, developer tool, it
would be running on your laptops, right?

So local development server on your
Mac or Windows machine or whatever,

running in a command line interface,
a little web app that you have.

I've heard people running it
on, cloud applications as well.

So like they deployed at
AWS or Google cloud or cloud

runner or any of those places.

And they're either storing or
doing like vector comparisons

with SQLiteVec on those platforms.

So it is totally possible to do, to use
SQLiteVec in like desktop applications

as well, mobile applications.

I recently added support and I will be
adding better support for Android and iOS.

So you could theoretically run
SQLiteVec and store vectors in a

SQLite database on your Android phone
or on an iOS application and such.

It runs on a Raspberry Pi.

Actually I have a little device here.

I have a little this
is a Raspberry Pi Zero.

It doesn't need Wi Fi, but it has a
little like keyboard and e ink display.

It's called a Beepy or a Beatberry
and this can run SQLiteVec.

It's just a Raspberry Pi.

So you could like SSH into
it, Compile or download a copy

of SQLiteVec, run it on here.

And you can have those embeddings run
and you could, generate embeddings

on here, generate embeddings on the
server, or wherever else you run.

So it's my, my main focus was like, get
SQLiteVec running everywhere, right?

And it does run everywhere.

Including, WebAssembly in the browser,
mobile devices, all that such.

And I've heard anecdotally, people
running it on all these platforms as

well, in either like test applications
or trying things out, mostly because

also SQLiteVec is fairly new, it
just came out like a few months ago,

Nicolay Gerold: yeah, I think it's
very interesting, especially with the

entire local first movement happening,
which got so many endorsements

from, like, all over the place,
which I would have never expected.

And I think the local AI part
is something I'm seeing more and

more, and we are implementing more
and more, and I think SQLite is a

very interesting choice for that.

I will be on the lookout, whether
Apple is integrating SQL live vector

for in their like messages or contacts

Alex Garcia: Yeah, I know that Apple
with their Apple intelligence stuff,

they do like on their slides, they
say semantic index, which is, an on

device vector index of some sorts.

I don't know if they probably
don't use sequel effect.

I know that with Microsoft, they have a
recall AI feature that was like, it would.

Take screenshots of your computer every
so often and you could talk to about

talk to your computer, be like, oh what
was I doing on this website before?

Which first off had a ton of privacy
and security applications that,

that was just absolutely terrible.

But one cool thing that came outta
that was that Microsoft was actually

storing their vectors inside of a
SQL wide database on your computer.

So it wasn't like syncing
to the cloud or anything.

They used their own cluster
customer implementation.

They didn't use SQL Live Back just
'cause it didn't exist back then.

But they did have some form of
like disk ANN support that was

built on SQLite, where the nodes
and edges were stored in a table.

It was a complicated thing.

But I think that showed that like
the technology stance, right?

If these companies are building a local
AI stuff and they have all this, all

these resources and they just end up
storing their vectors in SQLite, then

this is a pretty good approach and it'll
probably work in other places as well.

Yeah.

Nicolay Gerold: Yeah.

What I'm really interested in is
especially the, like how far you can

push it, like what are the limits,
especially like of search speed, but

also of index size, especially when
you get into the larger embeddings,

like 1000, 2000, even 4, 000

Alex Garcia: yeah, I think first off
for limits of SQLite in general they

say that you can store terabytes, like
terabytes of data within SQLite before

it actually hits its theoretical limit.

But I think the actual practical
limits are much, much smaller.

I think for SQLite VEC specifically
Right now, SQLitevec is just

brute force search, right?

It's just linear search.

It compares every single vector inside.

There's no approximate
nearest neighbors index yet.

It just does like a brute
force across everything.

I mainly did that for,
one, it was much easier.

Two, SQLite is very the storage format for
SQLite is very specific, and some of the

ANN indexes that are out there just may
not work with how SQLiteVec stores data

within B trees and pages and all that.

So I didn't really want to choose
one where if it, I didn't really

want to choose an ANN index too
early and figure out it was wrong

and then have to change things later.

So I just held off on that at first, but
I think practically the the practical

the practical limits for SQLiteVec,
it's probably going to be like from

what I've seen, you could do tens of
thousands or hundreds of thousands

of vectors where the dimensions are
like, 768 or 1,024, like you could

probably like store tens of thousands or
hundreds of thousands of vectors before

it get, it takes too long to search.

I think what I tried was a 768
dimensional vector vectors.

I could probably store like half a
million of those inside of SQL database.

And the search results would be like.

Around 500 milliseconds or so.

So I think in general, I'd like when
searches searches, especially most

SQL queries to like to take less than
a hundred milliseconds to to execute,

because then there's a lot of other
overhead and that's typically around

like where people can like really
tolerate or before they even notice that

results are like quote unquote slow.

So I would say tens of thousands
to hundreds of thousands.

And if you have like larger dimensions
like, 2000, 3000 dimensions and like

probably even smaller, then probably
like the a hundred thousand range would

probably be a little bit too much.

At least for this initial full scan,
linear search, brute force thing

that SQLite effect does currently.

If you could tolerate it SL live
effect does have quantization support.

So if you wanna do binary quantization
or scaler quantization or Matryoshka

embedding, so you'd cut them in half
or cut them at a smaller distance.

SQLiteVec does have support for that.

And with that, you could probably
like add a little bit more, right?

With binary quantization,
it's, 32 times smaller.

So you could probably do like a million,
binary vectors in SQLiteVec before you

notice some performance issues there.

So that's typically around
like where it is currently.

I want it, I want to get
it in a better place.

I think with a proper ANN index that
will be a part of SQLiteVec in the near

future, I can imagine it doing a million,
maybe like a few million vectors before

you really hit like some, before you
really hit some limits that where SQLite

and SQLiteVec wouldn't make sense.

I think another thing, it
would be metadata filtering and

partitions, which will also be
included in SQLiteVec very soon.

I think that could help
it scale even more, right?

Because even if you have 10 million
vectors if you're only searching

a subset of those you have 10
million, you have a thousand users.

All those users have
a thousand embeddings.

That's, a million vectors.

It might be too much
to brute force search.

But if you could only, if you only
search for one user at a time, you

only search for their 1000 embeddings.

Searches can be a lot faster and
it could be a lot more tolerable.

And I'll be including features
that support that, including

partitioning data and all that stuff.

And metadata filters relatively
soon in SQLiteVec as well.

So yeah, I think currently, all that
to say, currently tens of thousands,

maybe hundreds of thousands of vectors.

If you do binary quantization, maybe
maybe a million but with some features

that are coming down the line for
SQLiteVec, I imagine it could support

low millions, maybe 10 million vectors.

Nicolay Gerold: Shouldn't you already
be able to do like metadata filtering

by just filtering down on another

Alex Garcia: Yeah.

Nicolay Gerold: and then

Alex Garcia: Yeah, you could.

So there's two ways of storing
vectors within SQLiteVec.

One way is a manual way where you
just, store a JSON array of text

or a blob of vector data inside
of a regular column in a table.

And that works just fine.

It totally, there's SQL scalar
functions where you can manually compute

distances and then just do select
all from order by distance limit 10.

And that's a K and N search.

And that works just fine.

There are some performance
implications with that.

If you have very large vectors say if
you have a 2, 000 dimensional vector,

that's going to be like 8 kilobytes
of storage per vector to store, right?

And if you're storing that alongside
your other data in a regular SQLite

table, where that data is stored in
a row oriented format there could be

performance implications there where
like the, the disk head has to like

scan through a lot more data to, to
do that because you're storing these

like large blobs within a table.

So there are some like
performance implications out

there if you're doing that.

So another way of storing vectors
within SQLite is using a virtual table.

So we have a vec it's a
virtual table called vec0.

And it's very similar to a, the
virtual tables that are found in

the SQLite vec full text search
extension and our tree extension.

It's create virtual table and
then there's some configuration

options you could add there.

What that virtual table does is that
you're able, it stores, Vectors within a

SQL wide database in, in, within your SQL
light database in inside of shadow tables.

But it chunks of vectors into
large blobs where it's like a

large chunk of a thousand vectors.

And that makes searches a lot faster.

It stores data completely separate
from your other data, so it's like

a separate table you have to do.

You have to do a join which
could be a little bit awkward at

times, but it works just fine.

And with that searches are much faster
and there's some other like performance

gains we do that because we store
things in any contiguous block, we

can read it into memory lock faster.

We could do things like SIMD acceleration,
stuff like that, SIMD acceleration.

So there's a lot, there's a
lot of benefits to doing that.

But with but the problem with
that virtual table is that there's

currently no metadata filtering there.

That's something that I would
have to implement myself.

And that's something that I will
implement and I'll have typed where you

can like just What metadata filtering
will be in SQLiteVec when this is

completed on the vec0 virtual tables,
it will just be regular columns, right?

So you can declare extra columns
alongside your vector index.

You can do metadata filtering with
WHERE clauses, so it's just like WHERE

submitted at is between these dates, or
WHERE this label equals Red or whatever.

So that will come and that will be a part
of SQL in general, but it's just not a

part of the virtual table implementation
yet, but if you're doing anything, if

you're doing things manually, you could
totally do that yourself and you can

do, if you're storing it alongside of
regular column, you could do normal SQL

queries on top of that and be just fine.

I think another thing is you could
also do pre filtering with both the

manual implementation and the virtual
table implementation, where if you have

another, you can do another SQL query
that just returns a list of probable IDs

or IDs of items that you want to search.

So let's say, for example, I'll get
all the documents that this user

has access to, and then you can
pass in that allow list into into

the query for virtual tables and
vec0 tables, and then it would just

consider those when doing comparisons.

So that is one way of doing it,
but it could be a little bit slow.

It could be not as performant as as some
of the other ways that will be coming

down in the pipeline in a little bit.

Nicolay Gerold: Yeah,
that's very interesting.

I think my mind already jumped
instantly to re rankers as well.

That you actually could support re
ranking with when you have especially

like local search applications or
some form of recommendation system

in the front end, which could make it
way faster, especially if you imagine

like an e commerce site, but the
user is already navigating the page.

So you have some filters available
when the user is navigating

and you have to search query.

And I think this could be like an
interesting addition to SQLite back

as well, like adding some re ranking
capabilities could be very cool.

Alex Garcia: Real quick, do you mean like
a re ranking model or just re ranking a

binary quanti a bit vector and then re
ranking it with the full size vector?

Nicolay Gerold: No, re

Alex Garcia: Oh, re ranking model?

Okay, yeah.

Okay.

Nicolay Gerold: In the end, the re ranker
often integrates like multiple different

factors, so it's often combined also
with for example, a score for the recency

of the data or a score of like past
interactions, for example, of popularity.

Is a common score that you actually add to
the different items and you basically can

implement like or add business logic to
the re ranking as well that you don't just

have basically the query and document.

Or query item re ranking sort, but
you add like more of the business

logic scores also on top of that.

So you can optimize for certain

Alex Garcia: Yeah.

I wasn't aware of that.

I always thought that re ranking models
just re rank based off content itself.

And I always thought that, that value
was limited, but if you could like

include other metadata, like popularity
and stuff, that's pretty cool.

That's it sounds like it could
be a main part of SQLite pack.

That'd be interesting.

Nicolay Gerold: That's, yeah,
and there are like two ways

of implementation in the end.

One is you actually implement
or train it into the re ranker.

By actually using like the popular
items in your training data set.

But in the end, there's like biases
it towards like very popular items

all the time, which is something you
often don't want to, and you can also

basically just use it as a weighted
score and a weighting function, which

is often a little bit more controllable.

Alex Garcia: That's so cool.

Yeah.

Nicolay Gerold: Nice.

And I think for me, it's with
embeddings when I'm working with

them, I'm often fine tuning.

Models and keeping it up to date because
you have some data drift in practice.

How could I approach that with SQLite Vec?

And especially when I'm assuming I
have it deployed on multiple different

devices or multiple different customers.

How can I actually keep the
embeddings basically up to date

with the most recent model?

Alex Garcia: That is a good question.

So it's a different embeddings
model that you fine tune and train.

And like you insert some vectors and you
embed some stuff on March 1st and then

on March 14th, you train a new model
on those embeddings and you want to

go back and update those, the previous
embeddings you had with the new model.

Okay.

Yeah SQLite in general is just SQLite
at the end of it and just pure SQL.

So you could go back and update a table
or a virtual table that has vectors

in it with new values that you have.

You could also have triggers as well.

So like when the, so for example,
if you have a, if you're embedding

comments from a YouTube video and
you have a column called comments,

and that's the actual like text
comment that someone left on a video.

And you're doing some vector
search on top of that.

So you're storing those embeddings
of those comments in a separate

virtual table with SQLiteVec or just
in another column that's a JSON array

or a blob representation of a vector.

You could always use triggers.

So whenever the whenever you have
to update the comment or whenever a

comment is updated, you can have a
trigger that goes and updates that

virtual table or updates that other
column with a new representation of

what the embeddings of a new embeddings
representation of the new comment.

And I think if you are syncing with
a new fine tunings model or if you

change the entire embeddings model
in general, I think an update like

that would probably be what's needed.

And as long as the dimensions match up
and all that, it should work as expected.

Don't know if I have a
good answer to that one.

Yeah.

Nicolay Gerold: Yeah, I think that's
one of the, especially in the local

first era, that's the most challenging
part, like the syncing between like

different devices, between the user and
the backend and how to actually make

that possible without like overrides.

If you have multiple users working
on the same data, which is a very

Alex Garcia: Yeah, there are a few
tools and features that people have

built that like wrap SQLite to do,
some synchronization work where you

have, multiple people with their own
like version of a SQLite database.

And then it eventually all gets
synced back to a main, master copy.

I don't know what the status of
those tooling of those tools are.

Some of them could be like quite
complicated to build and they

may not have a lot of funding.

So I don't know exactly if there
are any good options out there.

And I guess I don't know how well they
would work with SQLiteVec in general

because SQLiteVec uses virtual tables.

You might be able to sync the
underlying shadow tables where

the data is actually stored.

But I've just never tried that myself.

I think there are a few like good
backup options for SQLite where it

will continue to continuously replicate
what you're working on into a, into S3

or some other, replication tools that
are out there specifically for SQLite.

SQLite does have some form of There's a
session extension where you can like it's

not necessarily like forking a SQLite
database, but if you have your own copy

of SQLite database and you make edits to
it, and then you want to sync those edits

with another, with someone else and have
them, merge together and then raise an

error if there's a conflict or whatever.

There is a session extension
that does part of that.

It's not very well used.

I don't know how, like, how often it's
used in like real world environments.

But there are some ways of doing
that and some ways of synchronizing.

But I think with since SQlite-Vec
if you're using the virtual

table implementation, it might
be a little bit hard to work

that into some of those tools.

Nicolay Gerold: Yeah, I would really
love to try to build like a local only

recommendation system, which basically
collects the user data only in the

front end, adds it to a vector database,
gets the items, embeds them, and then

basically does the recommendation.

This would be a fun side project

Alex Garcia: Yeah, I think I usually
get around like synchronization issues

by just not synchronizing and just
having it like completely local, where

you don't need to like share with other
people or do it or do anything like that.

Just like try to make it more
more private if possible.

That also just sidesteps the entire issue.

Maybe not the best approach, but yeah.

Nicolay Gerold: Yeah.

What I would love to hear, what
are like your go to Models on when

you're building the local stuff,
are you using custom models as well?

Or are you rather tending to go with
something like cohere or open my eyes

Alex Garcia: I try to do open source
models as much as I can, and I avoid

OpenAI and Cohere mostly because being
able to write it myself, I find can

be a lot easier and a little bit more
future proof and you don't have to pay

for it, which is great for cost and all.

When I do choose an open source.

I typically choose a sentence transformers
model mostly because those can

translate into a GGUF format for llama.

cpp, which is usually the inference
model or the inference engine that I

use when I do embeddings models, whether
that's with llama file, which is a

Mozilla AI project where you can run,
local models with You can run local

models including embeddings models
and generate embeddings through that.

That takes gguf model files,
and I like those just because

like it's much easier to run.

It's just a single file, SQLite.

The inference engine itself, llama.

cpp, it's just like pure C It's very easy
to compile and use in other projects.

I also have another SQLite
extension that's like a sister

project to SQLite Vec, called
SQLite L Embed, which uses Llama.

cpp to generate text embeddings.

That's usually what I use as a go to,
because I'm already in the SQLite land.

A gguf model is fine, and Sentence
Transformers does pretty well.

You can convert a center transformers
model into a gguf format quite easily.

I typically choose the really small one.

I forget the name of it just because
it's a bunch of like numbers and letters.

It's like the all mini L2
version 6 something or the other.

That one's it's 30 megabytes.

It quantizes very easily.

The only problem with that one is
that the context size is fairly small.

It's only a couple hundred.

Tech tokens, I believe.

But I like it because it's
very small, quick and easy.

And it was trained a while ago.

So it doesn't know a lot
of like recent stuff.

So if I do want something that's a
little bit more recent that knows

things about COVID 19 or anything like
that then I have found Salesforce has

a few embeddings models recently, I
think it's called like Arctic embed.

No, I'm sorry.

That's Snowflake has a few embeddings
models called Arctic embed.

The recent version 1.

5 is fairly nice.

It does have some support of Matryoshka
embeddings and a few other like

nice features that come with it.

A little bit larger, a little bit
more difficult to run just cause it's,

instead of a 30 megabyte file, like the
other one is this one's like probably

in the couple hundred megabytes size.

But that one is the, it's
trained on more recent data.

It works with more, if you're working with
for example, I was working on a little

project that does news article headlines.

Scrape the last couple thousand headlines
from NBC news for American American

newspaper, and they're doing searching
searches on top of those headlines.

And those are fairly recent topics.

Some older embeddings models may
not know about it, but this Arctic

one seems to know a lot about it.

So it does like pretty well.

I also try out Nomic has
a few open source ones.

Mixed bread has a really good
embeddings model that does

binary quantization very well.

So whenever I can, I just try
an open source model, just

because like they're easy to run.

I, they're free.

You don't have to worry too much about
heading in another API or whatever.

I think another nice thing about using
a local model is that you don't have to,

it typically, it might run faster not
because it's running on your own hardware,

but because there's no network delay, or
you don't have to worry about API keys or

sending in a REST API request or whatever.

So keeping it local kind of makes
that part a little bit easier.

And less likely for there to be
bugs or crashes or whatever else.

Nicolay Gerold: Yeah.

How have you experienced like the
system demands when you're building

something, especially user facing and
when you have multiple different models

running, you have one file for like your
transformer model for generating text

and I learn and one embedding model,
maybe even some stuff on top of it.

Have you seen any interesting strategies?

Swapping in and swapping out different
models into memory or something

interesting how that is handled.

Alex Garcia: Not that I've seen.

Typically the most I have running
is like you said, like maybe one LLM

if it doing like a rag application.

So one LLM style model where it's
maybe it's a llama or something else

that does like text generations and
another model that's strictly for

embeddings, whether that's snowflake,
Arctic bed or whatever Typically the

embedding models are fairly easy to run.

They're like probably less than a gigabyte
and more so even smaller than that.

So it doesn't take up
a lot of memory or RAM.

So you don't have to
worry too much about it.

If I was running multiple
models at the same time

Yeah, I don't know if there's a
lot of good strategies for like

swapping on memory and all that.

I would just shut down the instance
and or have a separate server.

I think one thing one nice thing
about doing local stuff is there's

a lot of other tools out there that
kind of make it easier to talk with

like other devices on your network.

So my, I develop on a 2019 MacBook.

It's the old Intel processor.

It's fairly old now.

It doesn't have a lot of RAM.

I can't run like Lama on it or anything.

Yeah.

But I do have a newer Mac mini
that sits on my desk all the time,

and I could always SSH into that
and run another model, right?

Or I can use tail scale to access
to access my Mac mini if I'm out at

a coffee shop or whatever, right?

And that could run Lama and I
could run the embeddings model on

my computer and the the network
delay may not be too bad, right?

And that's one way to just like
using, if you're running stuff

on device, you probably have
most devices on your computer.

Maybe you could use some of them to do
like a hybrid mixture of things, right?

Or of course you could always
pay someone like open AI or or or

Anthropic to host a larger LLM model
for you if you wanted to do that.

So there are like ways of either
using multiple devices or just

using another server if you have
like multiple models running.

But I don't know about if there's
a good option for swapping

models, like in memory and stuff.

Yeah.

Nicolay Gerold: Yeah on the topic from
before the current models I think I,

I still have a strong preference of
using models before 2021, especially

embedding models, because I'm still
uncertain about the impact of all the AI

generated content on the web which is now
basically in all the different models.

So I really liked like all the BGE models.

Which are fairly good if you
actually fine tune them as well.

Through which you actually get
the relevant information or

knowledge into the model as well.

Alex Garcia: Yeah.

I think fine tuning is probably
required for a lot of applications.

I think when you're developing or
doing prototypes, you don't really

need that stuff all the time.

But yeah, like as soon as you have an
actual use case for it and you have an

evaluations data set and you just need
to get like something out and done and

like actually in front of customers or
users, then like fine tuning is probably

going to be required for whatever you do.

Yeah.

Nicolay Gerold: Yeah.

Anything beyond that in your stack?

Especially like you mentioned Lama file.

Probably transformer JS for all
that sentence transformer stuff

when you're in the front end.

What else in your tech stack
or in your go to tech stack?

Alex Garcia: Yeah.

There's a few there's always like
developer friendly tools that are used.

Simon Wilson has a LLM command line
interface where you can interface

with multiple different, either hosted
LLMs, whether it's OpenAI or Anthropic

or whatever, but also local models
as well, and also local approaches.

Whether that's like MLX or Llama.

cpp or PyTorch or whatever, I think
that command line interface good

way to try out different models just
to run things on the command line.

I think for building applications
themselves, yeah, I typically

default to like Lama.

cpp mostly because it's easy to run
on device and easy to compile and use

compared to like things like PyTorch
or TensorFlow, which can be like,

at least in my past very difficult
to compile or run on many platforms.

So having something that's one
binary, one one model or one file

for the model itself with Lama.

cpp, that's usually my go to.

Transformers.

js, definitely a big one for
working inside the browser.

There's a lot of different text
generate, or there's a lot of

different embeddings models that run
in the browser through TensorFlow.

js that I use whenever I want to
have a demo of SQLite running in

the browser or just having a, a
semantic search in the browser.

That's usually my go to.

I think some other projects, Llama.

file, Llama.

file is a big one.

Llama file also has whisper file as a
new as a new feature where you could run

whisper on a single binary that runs on
multiple operating systems, which makes my

life a lot easier when working with audio.

Yeah, that's all I can think of right now.

Nicolay Gerold: Yeah,
I saw the whisper file.

The whisper file is really interesting.

Have you played around yet,
especially with VLMs as well?

Like vision language models on
operating on images and other

Alex Garcia: unfortunately not.

I think I think the reason why is
because Llama, CPP doesn't have

good support for vision models.

They had initial support for one or
two of them in the very beginning,

but they slowly just got deprecated
or bit right away where they

don't really quite work anymore.

Which is quite disappointing because
I think in the last like month or

so, there's been a lot of really cool
vision models that have come out.

To use them, you have to use
PyTorch or a pre configured Colab

notebook to actually use it.

And I've just avoided that just cause
I've wanted to run things locally.

But some of them look really cool.

I think.

Vision models are probably
going to be like the next big

thing for the next few months.

Whether that's specifically trained
for OCR or handwritten text or like

image recognition and all that.

There are a few clip embeddings models
that I've played with and SQLite Vec, of

course, it works with them because at the
end of the day, they're just embeddings.

you could store inside
of any vector database.

But yeah, I think there's going to
be a lot more movement in that area.

A lot more models, which
I appreciate and I love.

I just hope they get easier to run and
they're easier to run on other platforms.

Because I think even like TensorFlow
may not have great support or TensorFlow

JS, or sorry, Transformers JS may
not have good support for them yet.

Just because the inference on
vision models are much different

than, embeddings and text models.

And it may take some time for
them to really support that.

But yeah, I just hope they
get easier to run soon.

Nicolay Gerold: I would also expect
that vision language models in

general, they have to get like
bigger and bigger first before they

can get, Smaller and better again.

Alex Garcia: Yeah.

Same thing with like text models, right?

They got bigger and bigger.

And now I think they've figured out
like some good strategies to slim

it down or better training data.

So now you know, things like the GPT 4
mini are much smaller and faster to run.

Or some of the new LLAMA models that
are smaller parameters are better than

the ones that were, like large, larger
parameter ones from even two years ago.

So I think I think that
little hump is coming soon.

The little hump that happened
with text models a few years

ago, it's probably gonna happen
soon with vision models as well,

Nicolay Gerold: yeah.

Have you actually benchmarked SQLite
and SQLite VAC against the other, like

more common for tools for local first
vector databases or local first search,

like for example, Chroma, Sync, but
also if you're running on the server,

FICE is like, it tends to be my go

Alex Garcia: I have run benchmarks.

I don't know if I completely trust
them because benchmarks can be a

very sensitive topic for for a lot of
people and a lot of other projects.

But I'm confident in saying that
SQLiteVec, with a few parameters tuned, is

much faster at brute force vector search
than a lot of other client libraries.

With the exception of FICE, FICE is just
so much faster than a lot of other ones.

If you're doing brute force, FICE
is much faster than SQLiteVec.

There might be like one or two other
vector search, like C vector search

libraries that are faster than SQLiteVec.

At least in like brute force,
all in memory, all that stuff.

I think some other libraries out there,
they don't really like, they're not

optimized for linear scans or for brute
force scans because they care about ANN

index speed and recall and all that.

So in that case, SQLiteVec isn't faster
than any of those because SQLiteVec is

brute force and any ANN index is going to
be much faster than a brute force search.

But I would say that like in general,
since SQLiteVec is much smaller,

easier to run and easier to install.

And also inserting data into SQLiteVec
is fairly straightforward because again,

it just stores it one after the other.

There's no it doesn't have to
navigate through an HNSW graph or a

disk and graph to like insert stuff.

Usually insertion must faster and usually
searching in KNN queries are faster in

SQLiteVec than some of those other tools.

Again, for brute force
for brute force searches.

Nicolay Gerold: And also I think
what's an unappreciated aspect is like

that you have an actual data model.

Which often makes the iterations way
faster because you have to think about

your data first of how it looks and you
also can easily run all the different

filters, pre filter, post filter with the
additional columns you have in SQLite.

Alex Garcia: Yeah.

And And also one nice thing
about SQLite and SQLite Vec is

that it just does less things.

So there's less things to slow it down.

So for example the actual like
storing of data and writing to

disk and all that, SQLite Vec just
uses SQLite to do that, right?

And SQLite has been optimized
for very fast disks writes and

reads for the past like 20 years.

Whereas some of these other like
vector databases or vector, local

first vector tools that are out
there they had, they had to implement

everything from scratch, right?

They like re, they, they do their disk
access they wrote themselves or they

have transactional stuff they themselves.

They have to write their
own database, right?

Whereas SQLiteVec just use a pre
existing database that's out there

that's optimized for all the speed
and inference or optimized for speed

and quick writes and all that stuff.

So that is one nice thing about SQLiteVec
is that because it's on SQLite a lot of

things just run much faster because of
it, because it's a proven technology.

And we don't have to spend a lot of time
like deserializing or thinking about like

how we're going to store things on disk
mostly because SQLite already figured out

most of that for us, if that makes sense.

Nicolay Gerold: Yeah.

And what would you actually say
what's missing from the space?

Alex Garcia: I think for, on the vector
search and vector storage side Not much.

I think in the next few years, there
might be some like better ways to like

store vectors or a better and an index
that someone comes up with that's

faster or has better recall or whatever.

So I think on that side I
don't see much happening.

Again, there might be some, new
quantization techniques or new

ways to storing query vectors.

But I think the real innovation in
this space is probably gonna come

on the embeddings models side.

So with these embeddings models, if
they become smaller, faster, easier

to fine tune, easier to use, I think
that's gonna have a way bigger impact on

vector search, semantic search engines,
recommendation systems than anything that

like the vector database can do itself.

So for example, like if I can, If it's
easier to run a embeddings model that

is perfect, that is, less than 100
megabytes can run on any platform is very

up to date and knows information from,
from just a few months ago, whatever, I

think that's going to have a way better,
bigger impact on AI applications than

anything a vector database can do, right?

Because if a vector database, sure,
things will get a little, like a K and

N search may get a little bit faster.

Sure.

The recall rate will get better if you're
using an A and N index or whatever.

But I think like the real innovation
in this space is probably going to

come from like just better models,
faster models that are easier to

fine tune or use, or restrict to
your own, to your to your use case.

I think that's where most of the
innovation is probably going to come from.

Yeah.

Nicolay Gerold: Yeah.

I think you'd like the fine tuning part.

I think fine tuning embeddings
model isn't that difficult.

It's more about creating the
data set is very hard and I

don't think that will go away.

Alex Garcia: and I think on my
side who's never done fine tuning

and who has found all the guides
like a little bit intimidating.

I think that's why I think, oh, like
it should be easier to fine tune.

Whereas I think someone like you who's
done in the past or knows about, know

who has done fine tuning in the past
knows how easy it is, knows how whatever.

So I think like it might be a little
bit on yeah, it just might be a skill

issue on my side where I've never tried
actual like fine tuning or anything.

But I also like.

Nicolay Gerold:
Everything's a skill issue,

Alex Garcia: And also I think
it would be cool to do fine

tuning in the browser, right?

Or fine tuning in different languages.

'cause I imagine, maybe I'm also
wrong on this, that fine tuning is

usually like what a CoLab notebook
that you spin up that's like only

in Python that does all this stuff.

But if you could theoretically, if
know point, like point like a fine

tuning command line interface or
upload something to a web interface

that's just like the data that you
have with maybe some instructions.

We don't have to like, use a.

A Colab notebook or use Python.

You could use your own
programming language or have

a different interface for it.

I could imagine that.

Would make fine tuning
a little bit easier.

Maybe it's not possible.

Maybe I just have zero clue and don't
really know like how fine tuning works,

but I think something like that can
make things a lot easier for most

people because I would imagine that
like most AI application development

out there or application developers,
if they don't have experience in Python

or like fine tuning or even just like
training models themselves, I imagine

it's a little bit more difficult.

But I could also imagine a world where
there's, easier ways of fine tuning

through like a white browser or something.

Maybe I'm wrong.

Maybe that does exist.

Maybe again, skill issue, but

Nicolay Gerold: yeah, I think like the
bigger issue at the moment, all the

libraries focus on inference for the
local stuff because training is the type

of workload is very different because
you tend to have batches and You're

adjusting the parameter weights, which is
especially in Lama file, since it's like

one file, the parameters are hard code and
you would have to basically adjust them.

I'm not sure how easy it would be
to basically integrate all that

logic, for example, in a Lama file or
something like that, but that's for

smaller people than me to figure out.

Alex Garcia: Smarter than me as well.

The stuff they're doing in
LlamaFile is absolutely insane.

I have goes way over my head, but yeah.

Nicolay Gerold: I tried reading into it.

It's crazy.

What do you think like
overrated underrated?

I always like to ask that what's an
unappreciated and what's an overrated

Alex Garcia: Yeah, I think
underappreciated definitely like binary

quantization looking at all the different
like quantization tools out there,

all the different ways to make your
embedding smaller binary quantization

just to me seems so easy and works out
pretty well, especially if your vectors

are like large dimensions, right?

So if you have a thousand dimensional
vector a thousand if you have a

thousand dimensional vector and you
do binary quantization, all that

it is much easier to deal with.

You can store in a
regular database column.

You don't have to worry about a lot of
times you don't have to worry about a

dedicated storage that in a dedicated way.

Like for example, a SQLite Vec
virtual table, you could just store

it inside of a regular table and be
just fine because the actual size

is just a couple hundred bytes.

It's faster.

It's much easier to deal with.

It's faster to search.

It's much easier to generate.

It makes a lot more sense in my head than
some of the other quantization methods.

And the performance is pretty good.

And I think embeddings models can also
better embeddings models can produce

better quantization methods, right?

If those embeddings models train on
binary quantization loss, then it

just works out a little bit better.

It works out better for you.

I think overrated, I don't know
about overrated, but I think when,

technique that I haven't done as
much has been matryoshka embeddings.

So that is when you have an embeddings
model that is specifically trained

to generate embeddings that have
like different break points.

You have a thousand dimensional vector
and you also have another break point,

a matryoshka embeddings target for
500 dimensions or 250 dimensions.

Which in practice means after you get
that, once you embed something and you

have that thousand dimensional vector,
you can cut off at those different places.

So you can cut off and just have
the first 500 elements or the first

250 elements and use those to store.

And the the loss that you get
from that isn't that bad, right?

You could just, uh, theoretically,
you could just store those first

500 or first 250 elements, save on
a lot of space and just use those

for for different comparisons.

That's great.

And I love seeing it, but I think.

But I think the space savings you
get from that isn't that much.

So for example, like if you just store
the first 250 elements in a thousand

dimensional vector, you only save one
fourth of the space, which is great.

And you can maybe do some other stuff
like scaler quantization to get that

down to one eighth or something.

But with binary quantization
it's 1 32, right?

1 32 of the space you use.

From a full floating point vector into
a binary vector that has like way more

space savings and the Performance game
might be comparable if you're using a

good embeddings model mixed bread AI
They claim that their embeddings model

that was specifically trained on binary
quantization Holds up at 95 percent

quality when you do binary quantization.

So 95 percent Quality when you chop
off and we only contain 130 tooth

132 Tooth How do you say that?

One.

Nicolay Gerold: one

Alex Garcia: One thirty second, yeah.

One thirty second of a space
in a bit vector compared to

the full floating point vector.

Whereas like Matryoshka embeddings
may not even be able to do that, and

their quality is probably a little
bit worse than 95, 95th percent.

That, that's the reason why I think
binding quantization has a lot more

room to grow, especially because
that can be specifically trained and

easily trained in embeddings model.

As compared to Matryoshka embeddings where
you may not get as much space savings and

probably have similar levels of quality
after you do those, after you cut that up.

So yeah, I think like binary quantization
is probably going to be the future,

way easier to train, way easier to use
and in fact in fact, the databases and

Matryoshka embeddings, I imagine if
I see some numbers, maybe I'll change

my mind on that but I typically reach
for like binary quantization first

before I even consider any Matryoshka
embeddings or anything like that.

Nicolay Gerold: Yeah, they also
have limits in terms of fine tuning.

So it's way harder to fine tune
something like Matryoshka embeddings

than just regular embeddings, which
you then basically quantize post hoc.

And I think in fine tuning you like
to have the choice of what to do.

And you could basically do
the quantization already

doing the fine tuning as well.

But you can also apply it after the fact.

And I think that's the beauty of it.

Alex Garcia: Yeah.

And I think And also, With with
Matryoshka embeddings, I know snowflake

and their Arctic embed version 1.

5 embeddings model, They talked about
their initial version didn't have

Matryoshka embeddings and the added
Matryoshka embeddings in that 1.

5 model.

They did mention how they think
that the They only trained on

like one Matryoshka target, right?

I think it was like the full
vectors I want to say 1024, and they

specifically trained for I think
256 for their Matryoshka embeddings.

They didn't have multiple levels
like some other models do, right?

OpenAI, they trained on four
or five different targets.

Whereas like the Snowflake 1
only trained on, the Snowflake 1.

51 only trained on one.

And I forgot the reason why they did that,
but they do, they did have a good like

comparison of oh, when you have too many
targets, the quality isn't good as good.

Something like that.

They explained it more in their blog
posts of like why they only trained

on one specific Matryoshka target.

So I think there's still a little
bit more to figure out on that

space for Matryoshka embeddings.

Whereas like binary quantization, again,
like I think it just works out of the

box and it's probably easier to like,
if you're fine tuning, Transcribed I

imagine it's easier to like include that
as a loss function when you're doing

that so I'm not that's my hunch at least

Nicolay Gerold: What is something that
you actually would love to see built

that would make your life easier?

Alex Garcia: I think I think making other
Machine learning models easy to run.

I think one thing that one nice thing
about these, the LLM craze in the past

few years and local AI craze is that now
it's a lot easier to run those types of

models like text generation and embeddings
models locally as they were before.

I think like those types of
models, in the 2018, 2019 era.

You had to install PyTorch or install
a thousand dependencies and compile

something and it would take like a
long time to really do and it'd be

like quite difficult to do, to be
honest, especially if you're not

a Python developer but now longer.

cpp, it compiles to a single binary
these embeddings models and these,

for gguf files, it's just a single
file that you have to pass around.

So now they're like much easier to
run and it doesn't require a GPU.

It can work on all CPUs
and all that stuff.

So I think that's one nice thing
about AI and LLMs that happened that's

now it's just much easier to run.

But I think that hasn't happened for
other types of machine learning models.

I think one thing like vision models
that we talked about before those

are still pretty hard to run locally.

I hope that they get better.

I also think other models like natural
language processing, some NLP models

that are That were difficult to run
in the past and are still difficult to

run, or at least they don't work on all
platforms like these other ones do for

example, a glider, the one the, a new
NLP model that does some transformer

space thing I would like that to.

I'd like to see that be able to
run in different platforms, whether

that's like in the browser or in a
C based model that could be embedded

into other applications, because I
believe right now it requires Python.

It requires a pip install and
probably PyTorch and a bunch

of other large dependencies.

So I hope that like other machine learning
models, like NLP models, like these

vision models will get easier to run.

That's like my one wish for, for
the next few months and years.

Nicolay Gerold: probably yes.

And if people want to follow
along with you, hire you, see

what you have built, where

Alex Garcia: Yeah, so I think if you want
to check out SQLite VEC if you just look

up SQLite dash VEC, that's dash V E C.

I imagine it'll be on Google.

You could also go to my github.

com slash A S G 0 1 7.

That has a link to, SQLite
VEC and whatever else.

I also have a website, alexcarcia.

xyz if you want to check that
out, that has the links to, all my

different socials and some of the other
projects that I work on and stuff.

If you want to get in contact, you
could find my Twitter through there

or send me an email or whatever.

Nicolay Gerold: So what can we take away?

I think first of all, the
performance boundaries of SQLite

backup are very interesting.

Like he mentioned up to 500, 000
vectors with like 768 to 1024

dimensions are very doable in sub
500 microseconds, which is basically

the threshold for human perception.

So that's like really perfect.

And that's a really large Index, if you
consider you're running on, on the edge

or on the customer device, because nearly
no customer will probably have a data

set or a knowledge base that is as large.

And this is for brute force search.

So there isn't an ANN implementation here.

So this will become really interesting
down the road as it becomes more

efficient, like how much can you
actually do on the edge or on the edge.

Um, smaller devices and the optimal
use cases really are like medium scale

applications, which are local first or
where privacy is an issue and where the

subsecond search is still acceptable,
but I would expect it's probably really

focused on certain domains and on
consumer devices that you actually have

some knowledge that you want to store on
device because you don't want to share it.

Or because it's just faster.

So you could also come up with a strategy
where you just cache a set of data on the

customer device and search on top of that.

And in the meantime, while you show
the first few results, you basically

query your own main database.

And then once you get the results back,
you basically enrich the cached results

with the results from your main database.

I think this could be a really interesting
pattern because you would come away with

a lot more stuff where you actually don't
need as much distribution across the

world if you have a global user base and
stuff like that, then you actually have

to worry about, okay, all the networks
delay delays that could, you could

encounter also just general performance
optimization tips, like use binary

quantization for really large data set.

If you're doing it in SQLite
Vec, use metadata filtering.

So you can do subset searches to
make it faster and more efficient.

Consider basically partitioning
your data as well by user

or by some form of category.

And also just do performance monitoring.

Monitor the search latency
by different vector sizes or

different types of queries.

The

the complexity I think comes into
play with local development workflow.

So when you have that local database,
that you have to consider way more when it

comes to updates to the database, because
you actually have to do the migrations

on each of the customer devices.

And for that, to be honest, I would
be over asked, like, what exactly has

to happen there when you, for example,
have a schema migrations in one of

your tables, that you can actually
push it to all the different devices.

You actually can do some really
interesting experiments and

start way simpler with search.

Like if you start a search with a SQLite
database and on the customer device or

just running it embedded on an endpoint.

There are different solutions for
that, obviously, but if you run it

embedded, it's a very cheap way of
actually experimenting and also putting

something in production because the
Environment in which you run it So

basically the sqlite Database, which
is embedded on your user's device, but

also And the sqlite database that is
on your device when you're developing.

It's the same environment And which
often isn't a given in ai and ml so you

really can take it And experiment locally
and then put it into production and

it should behave like mostly the same.

And I think that's a really interesting
pattern that really allows you to start

simple and add complexity as needed.

And once your application really
like hits it off, you can actually

decide, okay, which database should I
choose based on the query patterns you

actually see in production already.

So I think you can really, you
have a more natural progression.

You can start simple, you can do.

a lot of the different types of
searches like keyword burst, you can

do hybrids, some re ranking, and then
you monitor the query patterns and

over time as the load gets too big
for SQLite, you basically move on and

pick Elasticsearch, OpenSearch, Vespa,
Quadrant, depending on your use case.

We will continue next week with
an episode on quantization.

So Alex talked a lot about it.

You will be going especially
into product quantization and

binary quantization next week.

So stay tuned for that.

And otherwise I will see you next week.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#039 Local-First Search, How to Push Search To End-Devices

Subscribe