S2E29

#046 Building a Search Database From First Principles

March 13, 2025 · 53:29

Nicolay Gerold: Usually, I don't do these
types of episodes where I go really deep

into closed source tools, but I think
it's really interesting what the guys at

TopK are building, and I think they're
building something similar to Snowflake.

But for search, where they basically
separate the compute and the storage

for analytics in the case of Snowflake,
and they are doing it for search.

And they're building on some really
interesting architectural patterns.

Which are more and more common
nowadays for example, using a log

as a database where you basically
write all changes to append only

locks first, and they establish the
exact order of operations that can be

replayed and you have data durability.

It comes with a little bit of performance
downsides, but it gives you a lot of

advantages, but also things like a
custom file format as we have seen

in, for example, lands to be, but.

Also something we are seeing more
and more like custom table formats

in iceberg and data lake, where you
control how the data is stored and

laid out and organized on disc so you
can get more performance out of it and

that you actually get it performant
enough so you can utilize really cheap.

object storage in the cloud, like S3.

And then they also build their
custom query execution engine.

And the query engine basically turns
a user request into something that's

more optimizable and understandable.

So basically it parses the query, creates
a logical plan, pushes down filter,

distributes the work across nodes, then
the computation happens, and then you

merge the results back in and return them.

And these three components, I
think, are really interesting as a

more general trend in All of data.

And in this epi episode we are talking
to Marek Galovich, who was a former

engineer at Pine Con and is now building
a new kind of search database at Top K.

And we will be talking a lot
about how they build it, their

architecture, but also the different
decisions they make that went.

Into building a search database
from the perspective of the 21st

century, like all the new hardware
and architecture components we have

access to, like really large, scalable
storage, really efficient compute, way

faster IO than what we used to have.

And I think that's a really
interesting discussion.

And also you can take a lot from
it, even though we are talking about

something closed source, let's do it.

Marek Galovic: One super nice thing of
like building distributed systems like

building distributed algorithms is Rust.

It's like precisely the async
ecosystem where like Rust itself

doesn't have an async runtime.

You can provide your own runtime.

And there is this project
from Tokyo called Loom.

And what you can basically do is
write your concurrent program.

It's async, right?

If you do any sort of network
communication, all of that

you would drive that to async.

And then you can use a deterministic
runtime to simulate all the possible

executions of those features.

And that allows you to basically specify
a formal model for, like, how the system

should behave, assert all the invariants.

And then essentially simulate
your distributed algorithm on

top of a single node, which like
makes testing much more robust.

Like we can actually like
test all the possible cases.

Nicolay Gerold: Why do you think Mervis
went mostly with Golang instead of Rust to

build like a cloud native vector database?

Marek Galovic: Before Rust,
Go was like the default choice

for like new systems languages.

Like I, when I like back then, like
four years ago when I built my like

open source VectorDB, I built it in Go.

Like it was a.

It's a pretty good choice for like
new sort of systems languages that are

a simple, simpler and like a better
ecosystem than C and C But unlike Rust

matured a lot over the last five years.

So I would say like Rust is now
becoming like the default choice.

Nicolay Gerold: Yeah, and maybe to kick it
really off, can you maybe quickly define

what is a query execution engine actually?

Marek Galovic: So like basically you have
a sort of description devices, like you

have a description, like what needs to
happen, like that could be a SQL query.

And then they have a bunch of data in
some format, and then the query execution

really takes that sort of description
of what needs to happen, compiles it

to some sort of set of expressions for
that data format and for what actually

needs to happen in a physical layer, and
then drives execution, drives the IO,

drives the computation to compute some
result and give it back to the client.

Super high level.

Nicolay Gerold: And in your case
specifically, like, how does it work

a specific query is coming in, what's
happening behind the scenes to actually go

from the query to the final search result?

Marek Galovic: Yeah, so we the API we
expose is it's very like data frame.

So we provide an API.

That's a bunch of stages that
describe what needs to happen.

And really, what happens is that we,
we build a, like a distributed query

engine called reactor internally.

And then what happens is the sort of
the routing node or the client facing

node gets the sort of public description
of what needs to happen, translates it

to a logical plan, which is a sort of
internal description of all the operators.

Then that's like some optimization
of the logical plan, right?

So like we want to push down
predicates all the scans.

We want to push down projections to only
read the columns that we care about.

That happens still on the router node.

And then since we have a big data sets
we distribute the sort of parts of the

data set to different executor notes
and send this plan to the executor

notes and then like executor notes,
since the model is more document

oriented, so like different files can
have different schemas because we don't

require you to specify schema upfront.

And so then the sort of logical
description gets compiled for a

specific plan on the first specific
file on the executor notes.

And then we drive the execution
that really dispatches

operators onto Tokio runtime.

We compute partial results, right?

So like every node does
its own part of work.

Then we send the result back to the router
node, which also like pipes those results.

So the same execution engine really
is like reactor running on both and

then returns the result to the client.

Nicolay Gerold: And within that what are
the more so small pieces of work that

you're doing can you give an example?

Marek Galovic: Yeah.

So for example, like if you are doing a
vector search one of the more so is like

you have a block of data which is like
a part of a file and then you want to

compute like sort of cosine similarity
between like vectors or projections

in that block of data and your query.

And so that more so is really give me
like cosine similarity on 10K rows.

Nicolay Gerold: Yeah, and you mentioned
in our previous conversation that you

have created your own custom file format.

What makes it better suited for
search use cases than, for example,

like Arrow, Parquet, and what
we have currently are also LATs.

Marek Galovic: Yeah, so build, build
a database to be like cloud native

and really like use object storage as
the sort of primary storage medium.

And like object storage has a very
different characteristics compared

to like locally attached disk.

Like spinning disk versus disk.

And that way is like your
request latencies are like

100 to 200 milliseconds P99.

And then like you also
pay per request, right?

So you don't want to do too many small
requests because that's super expensive.

So you want to take that into account,
like to take latency into account.

And then find optimal issue,
optimal request sizes to maximize

throughput and minimize cost.

And like really the issue with like
on the shelf format, take Parquet

as the default is that Parquet
really couples like the IO size, IO

execution with the statistics, right?

And you have like in Parquet row
groups, then within row groups you

have column chunks, then within
those you have pages and the page

really is the sort of unit of IO.

Issue with using that for search
is that Basically, if you have

like a bunch of columns, and one
column is like flow 32, right?

Or u32, like some small value.

And other column is let's
say 768 dimensional vector.

You will have a, if you read the
whole column, it's going to be

like 40 kilobytes for 14 k rows.

Versus like the sort of vector column
is going to be a bunch of megabytes.

And then since you're issuing
IO on that granularity, right?

You will do a very small request
for the scholar for a file.

So start scholar field and then
like huge IO for the vector field.

And that's an issue.

You want to have a very small pages
because you want to have statistics

that are very gradual, to do the
printing very efficiently, but.

Doing like very but also like you want
to for the case of U30 field, you want

to maximize the size because you want
to avoid doing very small requests.

And then coupling the sort of
statistics granularity to like IO

granularity is an issue for for
doing that on top of blob storage.

The other issue is like Parquet was
really built for spinning disks.

And so if you like read per K file,
you have a lot of serial dependency.

Like I need to read the footer.

Then the footer sort of gives
me a point of the schema.

Schema gives me a bunch of like
block indexes, and then I can read

the actual data that's like serial
dependencies, where each of those is 70

milliseconds to actually get the date.

For object stories, like you want to
do a lot of concurrent IO and like

we build a data format to really
have like white IO trees where we

can read some small amount of data.

And then issue a huge amount of concurrent
IO to actually pull the data very quickly.

So that's the two biggest differences.

You also mentioned building a
custom file format for search.

We actually use Arrow as our data
layout, the primary data layout, and

also the in memory layout for execution.

Although we extended Arrow with custom
layouts to basically support types like

dense matrices, sparse matrices, and
bit packed posting lists to optimize for

the use cases that we really care about.

If you co design the layout with
the execution, you can basically

do stuff like Execute the kernels
on top of compressed data.

So you don't really need to
decompress for example, posting

lists, you can do the actual compute
on the compressor presentation.

Nicolay Gerold: Yeah, and in search,
you usually pre compute a lot of stuff.

So you pre compute all the indexes, all
the statistics of the different documents.

What things are you pre computing
and where are you storing it?

Are you storing it with the row groups,
with the files, with the colon groups?

Marek Galovic: Yeah.

So we do pre compute like more efficient
representations of vectors, for

example, to make the scans much faster.

We pre compute inverted indexes for
like keywords and keyword search.

We precomp it's not really precomputing,
but for example, if you store float32

or you store string values, right?

We do dictionary compression to actually
make that much smaller and make the sort

of predicate evaluation much faster.

So this is the sort of
stuff we tend to precompute.

In terms of where we store it, it depends.

The, having a control over the file
structure allows us to basically store

the columnar data as columns, logically.

And then attach external indexes, for
example, like inverted index for keywords

is attached within the same file, but
it's like a separate from the column.

And like the query execution, then it
allows us to basically take the columnar

data, take the external indexes and
basically intersect the query and execute

the query on both in the same class.

Nicolay Gerold: Nice.

That's really interesting.

And how do you actually
handle distributed queries?

So you have Multiple files stored
probably in different locations.

How do you distribute it and be sure
that you actually have the correct

statistics and all the different files?

And also be sure that you can find like
all of the different search results.

Marek Galovic: Yeah, so maybe like
to answer this question, it would be

helpful to go over like the high level
architecture of how the system looks like.

So we have a we follow the approach
of the database is the log.

So when we get it right, we
append it to a write ahead log.

And at that point once we
acknowledge that we, you can be

sure the data is durably persisted.

And then on top of that write ahead
log, there is a indexing process that

tails the log and builds more read
optimized representation of the data.

And those are the index files.

And basically the sort of description
of what all, what are all the index

files is a manifest where we store
pointers to every file in the index.

And that the manifest really is a
sort of like a consistent snapshot

of the system at the point in time.

So if a, if an indexing process fails,
we can just load the most recent

manifest and pick off where we left
off, or pick up where we left off.

So that's like the
guarantee of consistency.

We, we always have a consistent
snapshot of the index.

And then in terms of like how
we execute distributed queries,

like the router node loads this
manifest, which is the description

figures out what all the files are.

Maybe it does some like high level
pruning to exclude the data that's

like not relevant for the query.

And then takes the files that are
relevant, distributes them over,

over a set of executor nodes.

And then what I mentioned the
reactor framework that takes over

and runs the execution on top of
the sort of the executor nodes.

Nicolay Gerold: Yep.

And do you actually redistribute the data?

So for example, if certain pieces of
information are often retrieved together,

did you actually redistribute the data?

Similar data that's retrieved
together is actually co occurring in

the same files and same locations.

Marek Galovic: Not really, like we
tried to minimize the assumptions on

access patterns, which like it's an
issue with the vector debits, right?

Because they assume like the locality
is there in the vector space.

There's your queries are usually
all over the vector space, not

necessarily coalesced together.

So we like remove that assumption.

The, in terms of like how data is
organized, we, like the storage

engine really is like an LSN tree.

So the data is sorted by the ID.

And so one access pattern that's very
common is that like people want to store

like multi tenant data in the same index.

And to, and the way to do that
very efficiently, like with our

system is you basically prefix
the ID with the tenant ID.

And then once we sort by the sort of
user provided ID, all of your tenant

data will be coalesced together, right?

And so then you can issue a query that's
hey, give me all the sort of data I want

subject to the ID starts with this prefix.

And then we can like prune all the files
that are not relevant and basically

just get the file that's contains
the data that you really care about.

So in that sense, like it is
coalesced, but we don't regroup the

data based on the query patterns.

Yeah.

Nicolay Gerold: you also mentioned
you're using a cache on EC2.

I'm really curious what's
the caching strategy?

Like, how is it determined?

What stays hot?

What's evicted?

And how do you actually handle
also the cache consistency?

Marek Galovic: So like specifically
what we cash, where is like part

of the secret sauce, give us like
good latencies and a low cost.

But on the level, like the, in terms
of like eviction strategy, it is just

LRU, like that works very well because
you're basically caching blocks and

like you wanna prioritize the data
that's like most recently accessed.

We we have the option to ping
customers in the cache and

provide the dedicated capacity.

But for the, like the shared cluster it's
LRU how the caching works specifically

it's two tiers where every executor
notes caches both in memory and on disk.

And then the blob storage, sorry.

The blob storage is is
a sort of durable layer.

The grounded of what we cache is
that the sort of data format has a

bunch of buffers, and then we cache
on thisk, like we cache buffers which

could be like multiple columns or
like multiple unrelated pieces of

data we like on at the right time.

We try to sort of ce relevant, like
relevant data together into the buffers.

And then in memory we cache like
smaller granularity that helps us

execute, like prune a bunch of data
very quickly, execute the first

layer, and then minimize the amount
of data we have to read from disk.

That's like on a, how that works.

Nicolay Gerold: Yeah, and I
think most vector databases, when

Actually run like two retrievals.

So the new data, which isn't indexed
yet, is like basically brute force vector

searches run, and on the older data,
you run like your HNSW index, whatever.

How do you approach the same problem
like of eventual consistency?

In the end, you can only run the retrieval
when it's actually indexed, but also

for the different types of search.

And use support because you have the
bm25, probably the classical term search,

but you also have the vector search.

Marek Galovic: So in terms of like
consistency models, we support three

consistent, three consistency models,
which is really two and one is masked.

We do support strong consistency.

So if you like write your, write the
data, Any client in the system can

actually read that data if it provides
I want a strongly consistent read.

It's more expensive because we
have to check what's really the

latest data against the source
of truth, but we can do that.

In the default mode, we propagate
the write within a second, right?

So if you write your data
that write will be visible.

Like the max delay there is like
600 milliseconds, so it will be

pretty much there right away.

And then that sort of needs to query the
data that's not really read optimized yet.

And then for like high performance
and like large scale use cases, they

don't necessarily care about like
data latency that much, like usually

just have some background job that
writes the data into the index.

And so if you want to get the best
performance, you can provide indexed mode.

Which skips the most recent
data and only reads like the

read optimized representation.

That's what gives you the best
performance, but the delay there

is like a couple of minutes.

Nicolay Gerold: Yeah.

What are the different use cases you have
seen for the different consistency mode?

Like when is actually like
strong consistency necessary?

Marek Galovic: Yeah for large scale
use cases that are not interactive,

the indexed mode is the best choice,
because you get the best performance, and

you can load the data in an async job.

That's not user interactive,
so it's not a problem there.

The use cases for the default mode that
we provide and then the strong consistency

is really like interactive application.

Imagine you're in the Jupyter
notebook developing your search.

Really, you don't want to wait 10
minutes after you write your data

to like, now I can test my query.

The default mode is really the best suited
there where you have a bunch of data

and then like when you go run the query.

It's there.

And I think the default and stronger
system modes are really becoming

super useful for agenting applications
that are interactive with the user.

You want to provide memory for your
agent or whatever, and then you don't

want the agent to not remember for 10
minutes and then show the memory shows up.

So if you are interacting with
the agent, like you want to.

It results right away, like doing this,
like either the default mode, if you

are fine with this modulate or doing
it like a strongly consistent read is

super important that because like you're
guaranteed to get the most recent, right.

Reflective inquiry.

Nicolay Gerold: Yeah, I think one
often like underappreciated aspect

of Elastic and the other search
databases is like the analytics.

Do you already have implemented like
a custom data format for the feedback

data and the analytical data as well?

Marek Galovic: We don't have a
custom format for any feedback yet.

The, like the data format that
we've built is suited for analytics.

So also like the query engine right
now, we optimize for search, but

it's basically like a matter of.

Adding more operators to support
stuff like aggregations, to support

metrics, to support stuff like that.

Because really in the end, it's
like an execution execution engine

on top of columnar data, right?

We optimize for something, but
basically you can add operators

to support other types of queries.

One side use case for building a very good
filtering engine and general query engine.

Is that like log search and basically
like security and observability is

a sort of a special case of search.

And we can like also like support
use cases like that very efficiently,

Nicolay Gerold: And can you actually
fall back on data fusion a little

bit more for the analytical queries?

Marek Galovic: not really like the
execution engine is a custom engine.

We looked at data fusion early
on and basically what we realized

is that like data fusion is very
good for analytical queries.

That's what it's what's been built for.

And like a lot of like new
databases actually use data

fusion as their execution engine.

The reason why it doesn't really work
for search is that like data fusion

doesn't support external indexes natively.

Like the way you implement indexes
in data fusion is that you basically

build one side of the join and then you
join your data with the index, right?

And like building that one side
of the joint is pretty expensive.

And if you have control over the
data layout and execution, you can

basically do like a sorted intersection.

Of both and then execute the sort of index
scan and data scan much more efficiently.

Like that's like the primary reason
why data fusion doesn't work.

The other reason is like what I
found is filtering in data fusion.

It's really it's rebuilding
the batches, right?

So if you filter, you basically copy
the data into new batches and then

continue execution on top of that.

Versus what we can do is essentially
execute filters as logical operations,

where if we know that the data is
already cached in memory, like there

is no point of copying the actual data.

You can simply create a new selection
mask or bit mask on top of that data

and then continue execution using that.

Yeah, like the, on the high level,
like basically you get three to

four times better performance
by not using data fusion.

Nicolay Gerold: Yeah, and you mentioned
also in our previous conversation

that you have built some custom error
extensions for dense and sparse matrices.

What do they actually enable you to do?

Marek Galovic: So one like huge unlock
there is that in, in like vector,

it isn't mostly like you would have
like vectors cannot be now, you

have your index and like you have to
provide the primary key for the index.

Since our model is more document
oriented, you're going to have

a bunch of vector fields we've
already treated as a data type.

And those can be null we don't require
it to always provide a value for a field.

And so one biggie analog for having
the data, custom data layout is that

if you want to represent vectors we
can represent nulls very efficiently

without actually storing dummy values.

And then also be able to do point
reads for those like for those rows

that can be null and may not be null.

Like the sort of the compression in
in the way you represent that, the

way you figure out where the value
actually is in terms of byte offset.

Arrow like stores an
offset list, which is U32.

Like the way we figured out how
to do that is like basically

like 20 to 25 times smaller.

Like really means for, like a
file of 10 million rows, right?

We are not reading just 40
megabytes of like dummy values,

but like you can basically read
100 kilobytes and do the same.

There is no overhead of that.

Similarly for like sparse matrices.

And I also mentioned like
bit packed arrays, like arrow

doesn't natively support a bit
packing, like bit packed arrays.

Also, it doesn't support sparse
matrices to basically optimize

for sparse matrix vector product.

And so we built custom types to compute
those types of scores very efficiently.

Nicolay Gerold: And the, what
other advantages do you actually

get from the error format?

Does it actually help you with
a distributed system as well?

Marek Galovic: It doesn't necessarily
help with distributed system.

Even if you look at data fusion, it's
like a lot of the value of data fusion.

Is that it basically uses arrow
and arrow compute kernels to do

like undifferentiated stuff, right?

So like filter floats, like every
database needs to filter floats.

Every database needs
to like filter strings.

That's not a differentiation you need
to like care about and spend like

engineering time writing kernels for that.

So like we get a lot of that by
just like simply using arrow and

like skipping the data fusion part.

And that's actually like what we get.

Like we, the custom types,
like we write custom kernels.

We optimize them for
different architectures.

That's the sort of.

meat of the database versus filtering
floats, filtering strings, like

doing undifferentiated ops, like
that we can just offload to arrow

and use their computer house.

And like that's a sort
of huge, the time saving.

Nicolay Gerold: I, the manifest files,
maybe to double click on that, I have

mostly worked with Delta and Iceberg
which basically have JSON files.

Partly why they do this is because on S3.

It's like there is a limit of
how many files you can list.

So you usually have manifest files
so you can list all of them at once.

How do your manifest files actually differ
from, for example, Iceberg and Polars?

Marek Galovic: Yeah.

Like on a high level it's very similar
because if you're building a database

on object storage, like you tend to
make similar choices in some regard.

The difference there is that like I
mentioned, like we are document oriented,

not necessarily like table oriented.

So we don't really enforce schema or
store schema in the manifest files.

And we have just one manifest
file for the whole index.

The sort of huge unlock
there is that Iceberg, you

mentioned like it's JSON files.

Yeah.

What we've built is basically we use
flat buffers as our like format for

describing what exists and what that
allows us to do to basically just

read the data and interpret, right?

So we don't have to decode j we don't
have to catch it in a special way.

We simply load the bytes and then use
flat buffers to know what's index.

That.

That's a huge saving, especially for large
indexes because you can if you do Jason,

like you can end up spending like 600
milliseconds just decoding the manifest.

That's like all of your
query budget pretty much.

So like by not spending 600 milliseconds,
but like only paying the IO once,

and then when using the data, we
can basically minimize the sort of

worst case latency that happens.

Nicolay Gerold: Can you also enforce
any asset properties with your execution

engine and with the file schema?

Marek Galovic: Yeah so as I mentioned,
like the rights are like atomic once

we commit them to the direct press log.

And the way we build the right headlock
is that it basically provides sequencing.

So we know, like at the time
we commit the right to the log.

We know like a version of that, right?

And then basically that's
the right, that's the most

recent version that will win.

And then the compaction and
indexing part consumes the log

and uses those versions to enforce
what's the newest version, right?

So we at the time we commit a new version
of the manifest or publish like a new

version of the index, that's like a
consistent snapshot of the database.

So that like moves.

The state from one
version to a new version.

And there is no sort of way to see
partial results in your queries.

Nicolay Gerold: Yeah,

is the,

no, I lost my question.

Does it also give you the capability
to do basically a history?

So I can go versions back in
time and run my queries on all

versions of the search database.

Marek Galovic: we, in principle, we
could do that because you could specify

essentially you can imagine every
manifest being a snapshot, right?

So we don't expose it right now,
but we could expose like all the

snapshots that we created and
maybe keep a last day of history.

In the manifest, and then
you could choose okay, I want

this snapshot of the database.

And then yeah, given it's a consistent
snapshot, you could just run a

query against that snapshot and you
would get that version of the data.

Nicolay Gerold: That's really cool.

And what I'm really curious about is
in terms of kernels, I think Rust.

It doesn't have all the, like decades of
optimizations, which went into C and C

plus, plus, have you done any benchmarks
of the performance impact this has,

because like in AI, this is like all the
talk, like, why aren't we using rust?

Like we have decades of
optimizations and C plus, have

you tried to quantify that stuff?

Marek Galovic: Yes.

So in terms of kernels, we
handwritten SIMD kernels for the

operators that we care about, right?

For example, like, how do we intersect
a bit packed list for text search?

How do very efficiently compute
distances for distances for vectors?

All of that, like we've written
custom kernels for both x86 and ARM.

So that, that allows us to like
basically run on the cloud.

Like ARM nodes are much cheaper and like
usually gives you better performance.

So you can pick the architecture
that you want to run, and then

yeah, we have kernels to compile
for that specific architecture.

Nicolay Gerold: Yeah.

And do you think rust is
faster or C is faster?

Marek Galovic: It depends on the use
case, like, all of them dispatch to LLVM,

ultimately it depends on the compiler.

You can get the same performance
as C or C but basically doing what

we did, and you basically wrap an
unsafe code in a safe interface.

And then in an unsaved do whatever.

So what we did is written,
which is like the instructions

for specific architectures.

And that's like basically
as good as you can do.

Nicolay Gerold: And in terms of
scale, what is the largest data set

you have tested top K with, like
how far did you try to push it?

Marek Galovic: So far for a single
namespace, like a single data set,

we pushed to a hundred million.

But basically like we've been doing that
for three weeks in a production scale.

With the latencies there being
like on the 100 million scale is

now like 200 milliseconds P99.

For smaller indexes, which is like
most of the tenants in the system,

that would be like 1 to 10 million.

And there, the latencies are like
around 50, 50 milliseconds P99.

Nicolay Gerold: Yeah.

And I think Milvus is at the moment
doing a big push into doing GPU

acceleration for vector search as
well, but also for the indexing.

Is this something you're already
exploring or that's like on

the horizon for the future?

Marek Galovic: mean, we looked at that.

It really depends on like the
use case you're optimizing for.

Like GPUs are quite expensive and you
if the operating point that you like

want to get is like order of hundreds
of QPS with a reasonable latency,

so let's say 50 to 100 milliseconds.

You don't necessarily need GPUs for that.

And you can get like very good
performance there for pretty low cost.

Same goes for indexing.

Like it depends on the chronology, right?

And the sort of algorithms you use.

But so far, like we did bunch of
cost estimations and we haven't

seen like a value for GPUs.

But I guess like it, it depends
for the use case, right?

Because if you have a training
algorithm, like a recommendation

system that needs to do Alta and Qs.

And needs to do that in one millisecond
latency, doing like HNSW stuff, doing like

graph based algorithms, or doing like GPU
accelerated compute it makes sense there.

But the truth is like, those
types of use cases are like 1

percent of all the use cases.

Nicolay Gerold: Yeah.

What is the use case you've
seen or what did you say?

What is the use case that
is like the perfect one?

For the search database you're building.

Marek Galovic: I would say, right now
the use cases we target are really the

ones where getting super relevant context
or results is critical, and that would

be Like financial health care legal use
cases and the primary reason for that

is that like you want to be able to do
like different types of retrieval in

the same query to basically say I have
a, I will not get semantically similar

results that also like satisfied is like
text filters and metadata predicates.

I didn't apply custom scoring on that
using like different vector scores and

maybe being 25 score in the same break.

And really, most of the existence search
systems, the way they do it, is they have

separate indexes for maybe different sort
of vector fields, then you have a separate

index for text, you get partial results,
then you do something like reciprocal

rank fusion on top of the partial
results, and then return to the plan.

I think that's the way to do hybrid
search with the current technologies.

The current engine that we built and
the system that we built allows you

to do all of that in the same phrase.

So you can basically do multi vector
search with a BN25 scoring and then

apply like a custom expression to
actually compute the top k input.

And then like score based on that.

And really what that allows us to do
is for example, one use case we've

seen is there's a medical search.

And what they do is they retrieve
like a passages from journal

articles and then overfetch that
and feed that to a re enter.

The reason why they overfetch is that
they want to boost some articles by like

their score for how good the journal is.

And really there is no way to do that
in the vector database that I'm using.

So they overfetch results, then re
sort based on this importance field,

and then feed that to the re anchor.

That's not necessarily the
best you can do, right?

Because you can imagine some
articles being super important,

but not really having the basic
relevance score on the vector system.

That wouldn't pass the first
top K, even if you overfetch.

And with us, what we can do is basically
move that sort of custom scoring function

into the database itself, where the
candidates said that they get It's

the best one based on their custom
string function and then feed that

to the rink and actually to the user.

That's because they can unlock to, to
give them better relevance out of the box.

Nicolay Gerold: Yeah.

Not just better relevance, but also like
often at the moment with vector database,

I think like a lot of filtering or if
you have to do post filtering a lot of

the times I've seen that the use cases
actually need you to extract informations.

After the search and when you have to
deliver a fixed set of search results,

it's like a real pain because often
you end up like you have to run another

retrieval in which you have to exclude all
the documents, which you already retrieved

Marek Galovic: Yeah most of the those are
like two or three years ago, like there

was a huge push oh, we do pre filtering.

We do post filtering.

In vectorDBs it should be told most
vectorDBs actually figured out how

to do some form of pre filtering in
the database, so they can filter the

results, and they can, it won't happen
you get empty results even though

in the database there are results,
because they pre filter before the TokB

operator, or starting with the limit.

The issue there is that usually
the distribution of vectors

and the distribution of filter
values are uncorrelated.

And so what happens is that you apply
your filter and then you have to

scan large part of the index anyway,
because the filter selects points at

random, basically in the vector space.

And like that's a huge way to think
about it, because like you spend all

this effort indexing the vectors, you
spend all this effort like building

IVF style index or graph based index.

And then at the query time, like
you basically end up scanning

most of the data anyway.

Which is like slow and wasteful.

I think that's like a fundamental
issue with building a database around

like a vector as a primary key.

Or like around a vector
index as a primary concept.

Nicolay Gerold: and what are the knobs
you actually expose or the levers that

you expose to the user, which allow
them to basically fine tune or optimize

the search for their specific use case.

Like on top of the standard, like I'm
running a term search and I'm optimizing

it, I'm doing boosting of certain terms.

What are you exposing to the user to
optimize the search performance, but

also the search result performance?

Marek Galovic: Yeah, so actually,
maybe, a sort of different

way of thinking about it.

Because we don't really expose NOPs,
like the query language we expose.

allows you to write any arithmetic
expression you want for scoring, right?

So you don't have hey, boost
this specific term by five.

It's really like you can write an
expression that gives you a numeric

output, and then score by that.

So you have flexibility, almost like a
programming language, like you write some

expression that gives you some number,
and then you can score by that, which is

very different compared to yes, I have
some keyword filter, and then I specify

just a fixed set of boosting parameters,
like that would be the case in Elastic.

And then I don't know what we
do explicitly you can, of course

provide different vector types, so
you can do fold32, you can do scalar

quantize, you can do binary quantize.

You'll be able to do like
sparse vectors very soon.

So those are like the,
vector scoring functions.

You can compute beyond 25 scores.

It's with the parameters for beyond 25.

And you can tune that.

But what we really want is off the shelf.

It should be, like, super simple
and super intuitive to build

production level scale search.

And then you can play with scoring, you
can play with stuff to push it further.

Nicolay Gerold: Do you want to integrate
learning to rank into this as well?

That you actually can come up
with the, basically the expression

in a more automated way?

Marek Galovic: Yes.

Ultimately if you can like how
actually production, the production

level search systems work at like
large companies, usually have some

document transformation, right?

Today, like that would be embeddings.

You can imagine like more ways
of transferring documents.

You have a query understanding
module, which sort of takes the

query and transposes it to some
sort of domain specific DSL.

That would be like a query language.

And then you have like on the output,
you have ranking and really the job of

ranking is to take a set of candidates and
resort them to some measure of relevance.

That's very domain specific and that,
like today you can get like a re rankers

from Cohere, from other companies, you
can get open source re rankers, but

that's really trained on like public data.

So that doesn't really mean
that this is the best relevance

you can get for your domain.

Or like for your specific application
and so like ultimately in the end what

we want to do is provide this like all
of these pieces in the same API with

the ability to like for the applications
to provide feedback for a this item was

clear this item was added to the kind of
that's a signal you can use to optimize

the ranking further and like you can
basically create application specific

rankers inside the database itself.

And that's a place where you
want to get get in the end.

That's the sort of ultimate goal.

Nicolay Gerold: And I think
it's really interesting.

I think the learning to rank
part is something that's a little

bit, it's At really large search
organizations, it's done a lot,

but it's not really talked about.

And I think it's a really easy addition
to the search stack, because you use

a, usually it's some form of boosted
machine learning model which is really

easy to deploy and optimize as well.

Marek Galovic: Yeah it's it's a spectrum.

For simple models, there's also linear
models that combines, I don't know

similarity scores with BN25 and some
precomputed features you, you start.

That's a sort of linear
regression function.

You can express that already
in the correct language.

So it's a arithmetic expression
you can just express.

It will work.

For more complicated rankers that
are, like, learned and use, I don't

know LLMs, you, you still can do that.

The issue there is if you want to
provide a good out of the box experience

you need to be able to improve the
model from a small amount of feedback.

And that's like a core research challenge
for how do you do that where like you

don't have much data but you want to
have like strict improvement on the like

baseline ranking model you provide out
of the box and once we can solve that

very reliably and be sure that we can
provide a good experience, we will put

that in as a part of the core language.

Nicolay Gerold: Do you already have
any research directions you want to?

You want to explore for doing that.

Marek Galovic: Yeah, so first thing
that is like really, how do you

improve off the shelf ranking models?

You want to have a good baseline
experience, so even if you don't provide

feedback, like we want to be able
to, the re ranking part of the core

language should give you better results.

But then how do you take that
and make it strictly better,

make it a part of improvement
given small amount of feedback.

So that's in the ranking side on
the document transformation part.

I think there's like really interesting
research and like improvements you

can do on training, like hybrid
dense and sparse embedding models.

And basically given that we can do both
types of ritual in the same query, like

we can actually learn the sort of optimal
combination of the two to maximize the

relevance of candidates that we get.

So that's like the second part of
research you want to do on the input side.

Nicolay Gerold: How do you think about,
because we are in a document model

updates to fields, it tend to be pretty
easy, but updates to an entire column.

So basically when I'm introducing
a new embedding model.

I find you in a new one and I want to
basically re embed all of my documents.

How do you actually handle
these like bulk or mass updates?

Especially also in light of like we
are in production and we still have

load running against our database.

Marek Galovic: Yeah.

So if you want to switch to a new
embedding model, one way to do that is to

basically create a new column, reinsert
the data with that new column, and then

basically switch switch your queries
from the old column to the new column.

That's a way to do it.

In general this is a hard problem
because if you build the database on

object storage, the files are immutable,
so the only way to really change

something about the files or even a
specific column is to rewrite the data.

Like you have to pretty much
reinsert the data yourself.

In the long run, once we actually
manage the embeddings for you, which

is something like we want to do where
like the interface is really text,

it's not written, it's not vectors.

Yes, I think that's like a better example.

Like ultimately you care about like
searching searching using text.

You don't care about like
vectors is just a technology.

Yeah, we use that.

But like what do you care about is
the sort of higher level interface.

Then once we do that, we will be
able to change evolve the embedding

models, evolve all that underneath
in the database, and you don't

have to necessarily worry about it.

Yeah,

Nicolay Gerold: Yeah.

How, I think the part why Elasticsearch
is still so popular is actually like the.

way of doing relevance engineering.

Like they have very good feedback
functions where you see the impact

for a specific query, how do the
different terms how do the different

parts of my my relevance algorithm,
my scoring algorithm, how do they

impact these sets of documents.

Do you provide a similar mechanism
for basically debugging or

analyzing the relevance and
performance of the search queries?

Marek Galovic: In the CURL language
I encourage people to check out how

the CURL language looks like, but
you can basically select different

parts of your scoring, right?

So you can create separate fields, right?

Vector, BN25, some internal fields.

And then do the final sort of
expression in the in the top k operator.

And then you can see what is the
contribution of individual parts

in the final scoring, because you
will get the partial scores and

the final score as a part of that.

So you can see, okay for this
document, the vector summit is really

good, but bn25 score is pretty shit.

The, for other functions, maybe
my bn25 is really good, but the

vector summit is pretty low.

And then you can actually fine tune
that and see what the contribution is.

Nicolay Gerold: Yeah, on part of
the documents, because you mentioned

documents don't have to have the same
schema, when I'm running searches with

filters what is, like, when the field
isn't existing, does the user have to

set whether to include it or exclude it?

Marek Galovic: Yeah, by default,
we evaluate only for fields

that have the value which is I
guess the most intuitive way.

And you can provide like isNull operator,
if you want to include hey, this field

must match this value, or it can be null.

Like you can actually express
that as a, as an expression.

Nicolay Gerold: Yeah.

What's next for you guys?

So what's, what are you building on at
the moment that you can already teaser?

Marek Galovic: Yeah, so we
have the storage storage out.

So you can already play with
that and like experiment with it.

The next part is really like adding
the, more of the end to end experience.

As I mentioned, like the sort
of, really the production search

is document transformation, is
query understanding, is ranking.

And we want to build all of that in the
same sort of API where like today, if

you want to build a semantic search, like
you have to build an embedding provider,

create embeddings, like start in some
database, retrieve that, send it to a

re ranker, like that's an unnecessarily
like plumbing and boilerplate, like if

you look at any guide, like how to build
semantic search, like it's all the same

module, like the last sort of paragraph,
like you insert your own, Product I

don't think that's a good experience.

What we really want to do is build this
into five lines of code and you should

be able to use text really as a, as an
interface and not necessarily like care

about, okay, this is embedded somewhere.

This is really somewhere.

So that's like the immediate
next step in the long run.

As I mentioned, like interesting research
on how do you do hybrid embeddings?

How do you do tailored specific,
like domain specific ranking?

How do you adapt rankers and
small amount of feedback?

This is right now is like a
research stage, but like somewhere

we're going to definitely go to.

Nicolay Gerold: Yeah.

What would you say is missing in search or
AI that you are actually not building on?

Marek Galovic: Yeah.

I don't know, like one thing that's
very interesting is integrating

like search with with the LLMs
on like a lower level, right?

So today it's the context is provided
as text, like you can imagine

injecting context in different ways.

Like really attention is in like LLMs
or in transformers is doing search over

the way it's stored in the model itself.

And so you can imagine like injecting
context on the sort of embedding

in like the latent space, not
necessarily in the input space of text.

So it's like a research direction I think
is interesting from a purely research

point of view, but training LLMs and
doing that is not really in scope for us.

Nicolay Gerold: Yeah, and maybe even
like under hyped, over hived, like in

the AI space, what do you think is under
hyped and what do you think is overhyped?

Marek Galovic: Overhyped, I think
like LLMs themselves, like in the

long run, I think they just become
commodities and you, like different

providers converging to the same point.

This is hard to predict, right?

Because there can be a breakthrough
that sort of like unlocks a new

capability, like no one perceived.

For underhyped stuff, I think like
verticalized AI, so building AI for

specific domain and specific use cases.

Where I feel like there is a lot of
interesting work happening, like applying

AI to like biomedical data, applying
to like material science, where like

training models, they're really what it
unlocks is basically speed of iteration.

So if you really want to get
fat, like you need to fail a lot.

And today, like failing in in the
biomedical domain and like material

science, where it was like building
the thing, which takes a long time.

So I feel like having good models there
and being able to use models, like

experiments and run experiments on.

Unlocks you to try a lot of stuff in
parallel, try it very quickly and really

like scope down to the the subspace that
really matters and then go and do physical

experiments just on that subspace.

And like it's really nice and I feel
like AlphaFold is the first example, like

first production level example in that
space, but I think there are like many

more, more coming in the coming years.

And it's super exciting.

Like it's it's going to be a
huge one for so many things.

Nicolay Gerold: it's, I'm so
torn on the LMS are overhyped.

I think they're definitely overhyped,
but I think they're still underutilized.

Marek Galovic: Oh yeah, for sure.

Like they're super useful.

It's just the making them like omnipresent
in the society, like your workplace,

that's going to take a bunch of years.

It's a, there's a lot of

Nicolay Gerold: I think like we
are skimming the top off at the

moment with LLMs and I think we will
see like a lot more stuff at some

point, but it will take a lot of
time, especially in the enterprise.

Marek Galovic: Yeah.

Nicolay Gerold: Nice.

And where can people learn more about
you, about TopK and what you're building?

Marek Galovic: Our website, topk.

io you can feel free to visit that.

We have a bunch of guides, we have docs.

You can sign up and try it out.

For myself, I'm on Twitter.

I'm on LinkedIn.

I do it with my name so
people can reach out there.

I'm happy to chat.

Nicolay Gerold: So what can we
take away from this that's relevant

for building AI applications?

I think because it's a little bit
different to our usual episodes where

we share like concrete advice, we
went really deep into how they build

it and how they design their tools.

So I would take away more
like patterns for building.

Data search AI systems.

And one of them is basically
separating write and read pass.

So especially like in databases you
often have a trade off, optimized for

writes or reads, but not really both.

And log base really can give
you a little bit more like

flexibility and optimizations of
actually exposing it to users.

Like what is he optimizing for?

And they basically deal.

Have done this by giving him the
user different consistency models.

So basically strong, default and indexed.

And through that, the user basically
can choose between what he's optimizing

for, whether it's read or write.

And I think thinking through the trade
offs of your system and often like

when you're in data and AI, it's read
and write like thinking through it,

like what are the knobs I can turn
to optimize for one or the other and

what trade off am I going for or what
trade off do I want to allow my user to

optimize for is a really interesting one.

And we see this pattern across
a lot of different databases.

For example, in Kafka.

We, or data systems in Kafka, you have
events, they're written sequentially.

Also in CICD, I think you
can draw a parallel that you

commit code to a main branch.

And then from that you build
optimized artifacts for deployment.

Design for your storage medium
is something that is also pretty

interesting, like S3 and cloud storage,
they aren't just like slower disks,

they're like very different economics.

And depending on the cloud you're building
on, you have very different economics

or pricing structures and taking that
into consideration as you build your

application is probably relevant,
because in most cases, you probably.

Have external constraints, which
determine what cloud you're building on.

And if not, you should be,
you should think about like.

How you use your storage medium and
then pick the according cloud provider,

because mostly storage and egress and
ingress are the major cost drivers.

And in S3 and cloud, you have
really expensive egress and ingress.

And

when you.

Make a lot of requests.

So basically you pay per request.

So having a large number of
requests is really expensive.

So having something like a lock and
then writing after a certain time

periods and basically combining or
batching the rights really pays off

because you can save a lot and also for.

For example, Cloudflare, I believe,
doesn't have any egress and ingress

fees, but they charge differently.

So they probably, I'm not sure
about Cloudflare, but they probably

charge more based on their storage.

So I think, like, when you design from the
get go for your specific storage medium,

you can optimize a lot in terms of costs.

Also what they do is basically combining
multiple things into one, which

usually run through different systems.

They combine vector
search and text search.

Into a single pass, whereas most
search databases see it as multiple

different types of search, which
are then merged and re ranked.

And this basically waste
works and misses good results.

And you don't really want to run separate
algorithms and combine the results later.

But you want to build execution plans that
filter score and rank in a single pass.

I think this is a really
interesting pattern as well.

Like when you can combine it, you
probably should think about combining it.

Another really interesting thing they
have done is basically strategic caching.

So most just use something like Redis.

They have built their own solution on
something like EC, on EC2, I think.

And at some point when you're building
application, every database needs a cache.

In front of it and since they have built
their own logic, they can do a little bit

more than just a regular cache because you
can build something like two tier caches.

So basically you have memory for
really hot data and then you have

local disk for warm data and this.

Can reduce cloud storage costs while
also preserving the performance.

And you can also not just cash the raw
data, but also cash intermediate results,

which are expensive to recompute.

And by building your custom caching
solution or a custom cache, which

takes a little bit more work, you can.

Find a caching strategy, which
really aligns with your workload.

So what's frequently accessed,
what's really expensive to rebuild

and what data is accessed together.

And I think as we start to build more
and more with AI and especially also

with agents, which access our own
data, this will become more important

because we don't really want to hit
our production database all the time.

Yeah, I think that's it for this week.

We will have one more.

One more episode in search next week.

And after that, people will be
moving on to a new season on MLOps.

So I'm excited for next week.

As always, if you liked the episode
and you're still listening, leave

a subscribe, or leave me a comment.

Also, if you have any suggestions in terms
of guests who you would like to have on,

let me know in the comments or send me a
message on LinkedIn, Twitter, blue sky,

otherwise I will catch you next week.

See you soon.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#046 Building a Search Database From First Principles

Subscribe