S2E21

#038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It

January 9, 2025 · 01:14:24

Nicolay Gerold: Developers
treat search as a black box.

They throw everything in a vector database
and hope something good comes out of it.

They also throw all ranking
signals into one big ML model

and hope it makes good results.

something decent and you don't want
to create this witch's cauldron

where you throw all your ingredients
in and hope for a magic potion.

It might work, but when it
does, you won't know why it's

difficult to debug and adjust.

You won't know which, for
example, ranking signal actually

delivered the important signal.

And if something goes wrong, you
need to retrain the entire model And

you can pinpoint why it went wrong
and you want the exact opposite.

You want layers of tools or techniques
aligned in a graph that you can

tune, debug, and update in isolation.

For ranking, this might
look something like this.

You have like a top layer
of personalization with

user specific adjustments.

So for example, this only
adjusts the result positions of

the different search results.

In the middle layer, you might have
something like signal boosting.

So, Popular items are in
within the specific query,

specific terms are boosted.

And this might also happen based
on some user behavior data.

And the baseline layer might be
a generalized ranking algorithm,

which is like your core relevance
scoring based on similarity, but

also like your TF IDF scores.

And through this, we basically can
control each layer and can tune it

independently of the others and also
enable and disable specific components

which becomes very interesting when you
actually want to adjust them based on the

user query and pipe them through different
paths through the graph you have.

And it allows you also to easier debug
why something went wrong because you can

pinpoint, okay, where was this result
actually like boosted up or ranked higher?

Um, so it actually went to the user and
it also gives you the ability to update

each layer without affecting the others.

So it makes it easier to A,
B test specific components.

And this reflects like a broader
principle in software engineering

where you prefer clear and separated
components over monolithic black boxes.

It's just the same as you wouldn't
throw all your application code

into like one child function.

So you shouldn't throw all your
ranking signals into one child model.

And you shouldn't throw all your
hopes into a vector database or

an embedding model by OpenAI.

And today we are continuing
our series on search.

We are talking to Trey Greinger, the
author of AI Powered Search, and we

look at the different techniques with
modern search engines and recommendation

systems and how you can bind them.

And Trey brings a wealth of experience
from His different search positions with

solar elastic search and also now from
his own shop search kernel Let's do it

Trey Grainger: I think RAG is a bad
acronym I think GAR is also a bad

acronym only because it goes both ways.

And so I think that it's more
of a synthesis of using using

retrieval to make generative models
better and then using generative

models to make retrieval better.

So I think it's actually more
that retrieval is in the middle.

You use generative models to help
interpret queries and understand them.

You then use sorry, you to,
yeah, to better understand them.

And then once you've gotten the
results, you also, take that to improve

the output of the generative model.

So I think it, it's, um, GARRAG or
RAGGAR if we wanted to start to combine

them together, but I don't know.

I think it's more important that
people understand the concept and

how to put these things together than
get hung up on particular acronyms.

Nicolay Gerold: Yeah.

I always think of like your search
system in the end, rather as a feature

pipeline for my generative model,
as opposed to like just retrieval.

How would you actually, how do you see it?

Do you think like, how I
think about it is wrong?

Trey Grainger: I think of whether it's
search or generative AI, I think of all

of these systems as nonlinear pipelines.

And so for me if I think
of, I'm just going to talk

about retrieval for a second.

If I look at a typical information
retrieval architecture, usually

there's a query intent aspect of it.

You take the query intent you run a
search to find results, probably over

requests, and then usually there's
some sort of a re ranking phase.

But that's a very overly simplified
version of what we're doing.

Because often you will Run, you will
try to interpret query intent, maybe

take that, go out and look up some more
information to better understand query

intent, and that might include going
to your index and finding what's there.

And so you refine the query
intent, and you request information

potentially multiple times.

You get results back, you do re
ranking, and then potentially, if

it's more of an agentic framework you
might actually return results to a

generative model that then determines,
I actually want to explore this further.

And, you continue on, and maybe you
repeat that process multiple times.

Where we are, and where we're
going, I think not thinking about

it as, retrieval is three steps.

Interpret intent get results and re
rank, but think of it as that is a

sequence that happens in a larger
pipeline, and that pipeline might

be linear or it might be non linear.

And then I think, to your actual
question, which was, how do you think of

retrieval in the context of generative AI?

It's the same kind of thing.

The reason we use retrieval when
we're working with generative AI is

because A generative AI model these
LLMs will take your query, your

request, whatever you're asking for.

They will then try to interpret
them and without access to up to

date information, without access
to correct information, they will

generate a response from their highly
compressed understanding of the world.

And so we use retrieval to
augment them with information.

But so that first phase though,
which is the interpretation of

the query is boosted by the second
page, which is the retrieval.

But then, all of that gets pumped
back into the generative model

to summarize, which is used
in the generative model again.

So in a lot of ways, I think of
retrieval as sandwiched in between

the generative AI, where there's
generative AI at the beginning and at

the end, and retrieval in the middle.

But again, You can always go back to
retrieval multiple times, you can, try

to assess and summarize multiple times.

It's better to just think of these as
almost two different techniques that

are almost, it's almost like you're
playing a game of pong ping pong back

and forth where the generative model
says, hey, I need some information,

the information comes back and it
just keeps going back and forth until

a sufficient answer is determined.

Nicolay Gerold: And this in the end,
it's like really use case dependent,

but in the end, it's like a graph,
which isn't the traditional graphs we've

seen, like data processing, like a DAG
which is a cyclic, but we can branch.

We can do really elaborate loops as well
and go back and basically your use case

should inform what actions do you take?

Like in e commerce, when I have to
be really fast, I probably can't do

many loops, but in something like
report generation at the moment,

it's a really hot topic I can do.

Probably as many loops as I like
and focus more on like really

getting the information down.

What?

Trey Grainger: Yeah, I think in search
in particular one of the key aspects

is people expect it to be fast.

So a lot of the time even though you
could find a way better answer with,

ten minutes of processing a lot of times
you have to optimize things to, get down

to, milliseconds or seconds of latency.

So there's definitely a
lot of tradeoffs there.

Nicolay Gerold: Yeah, but I think
like search has a real renaissance

at the moment because it's now
being used in way more context than

the traditional search was used.

Like we are going way past like website
search, web search, but we're using

it to feed the right information into
the LLM so we can actually perform

actions or do whatever we want.

Trey Grainger: Yeah, for sure.

And I think I don't know if
we've crossed the threshold.

We may have already but, go back a
couple of years and if you think of

the big web search engines, Google,
Bing, mostly Google Baidu, et

cetera they I like to think of the
web search engines as Basically a

giant cache of the internet, right?

You could take these generative
models and you could have them,

figure out what you're asking for
and then go crawl the entire web to

find the information and process it.

With web, do web crawling, all of
that, and then return a result.

But, it might take you 10 years
to go through the entire internet

to find the right results.

Whereas what the search engine does is
it serves as a cache of knowledge, a

cache of information that the generative
models are basically relying on to

get all of that data in milliseconds.

So the cache shouldn't make it
very efficient to find the most

relevant information to use.

And so that's why search engines
are so well paired with generative

AI for RAG because they provide
that quick access to the data.

So I yeah, I think the reason there's
this renaissance is because we need the

data for these generative models and
we're probably close to, if we haven't

already exceeded the point, where the
number of requests going to these search

engines, which were originally designed
for humans they're probably, in you.

Very soon, if we're not already there,
most of the requests are going to

be going to the generative models.

Bots already make a lot of requests,
but I see a future where humans

are the vast minority of the
requests going to search engines.

Nicolay Gerold: Yeah, and at the
moment I see two different camps

like talking on the RAG space one
is like really the AI side, which is

coming from LLMs and have discovered
like RAG and barely even retrieval.

And then you have the search side,
which is coming with all their like

methods for actually figuring out
retrieval and more old school techniques,

which are very well thought through.

What do you think are the misconceptions
of both camps when it comes to AI

powered search in this new generation?

Trey Grainger: Yeah,
that's a great question.

So I come from more of the search camp,
so I'm going to start with that answer.

What I see so I'm going to
just do an analogy to start.

So I got my start in the information
retrieval space migrating to Apache

Solr when Solr was very early.

And after Apache Solr Elasticsearch
came along and had some things that

were, better designed for that era and
Elasticsearch started taking market share.

They did something very smart and
very interesting, which is it's not

like they were coming and taking
away all of the people using Solr.

What they did instead is they used logs.

They used Logstash and Kibana to
make a really useful tool that, to

solve a problem everybody had, which
is, hey, I've got a ton of log data

and I need to be able to analyze it
and pull it together and analyze it.

By doing that, they actually
opened the door for an entire

new genre of search people.

They brought in all sorts of DevOps
people, all sorts of people that really

needed to do log search got in the door
got tons of installations, and then those

people, once they, had used the tool for
a while, started to realize Wait a minute.

I can do way more than
just log search with this.

I can search on all kinds of data.

I can build all sorts of
products on top of it.

And by doing that, they the
number of people in search and

information retrieval just exploded
once they developed those skills.

I see a very similar thing happening
right, happening now with generative AI.

So people start to build their
companies, their start ups.

They're trying to build
generative AI based products.

Once they start, they realize,
Hey, these things hallucinate.

They don't always give
me the right results.

I need to supplement this using RAG.

So they hear the term RAG, they don't
exactly know what it is, but what they

hear is oh I need to plug in a vector
database, I need to take my data, I need

to encode it with a transformer, I put my
embeddings in the vector database and now

my system can magically somehow take the
queries, get data from the vector database

and use that to supplement the results.

So they do this, they maybe use some
off the shelf tool and LangChain,

LamaIndex, something like that.

Once they've done it, they start to
realize, wow, this is like magic.

Somehow, miraculously, my
generative AI now has access to

real, live, up to date information.

And it looks like magic some of the time.

The rest of the time, they get
really weird results because the

data that's coming back from the
vector database for the retrieval

part of RAG isn't the right data.

Out of all the documents,
maybe they got the wrong ones.

And so then, they very quickly start to
come and look at why that's happening.

And what they end up doing is they
meander, they start doing some research,

and then they open this door, and
when they open the door to information

retrieval, they start to look inside,
and they see all of The folks who come

from an information retrieval background
waving, saying, hi, welcome to the club.

Here is all of our knowledge about
all these techniques and tools.

And some of them say, wow, that's amazing.

And they dive in.

And some of them say, oh, and
they shut the door and continue on

their way to solve other problems.

So I think that's my sort of little
analogy here, which is I think there

is a Renaissance in information
retrieval and search driven by

the generative AI capabilities.

But I think that very quickly people
are realizing that the R part of RAG is

really the hard part and the part that
takes, years and years to get right if

that part is critical to the system.

And I think the misconception is that,
oh, hey, for RAG I can just, plug

in a vector database and a couple
of libraries and, a day or two later

everything's magically working and
I'm off to solve the next problem.

Because search and information
retrieval is one of those problems

that you never really solve.

You get it, good enough and quit,
or you find so much value in

it, you just continue investing
to constantly make it better.

On the flip side, I think misconceptions
on the search side of things.

I think there's a lot of people who
have been in search for a while.

who know all of the old school techniques,
lexical search, stemming, minimization

field boosting using NLP libraries to do
parts of speech analysis and, boosting.

There's all these techniques that
sometimes they're still useful but I will

definitely say over the last two to three
years, I've probably thrown out about

twenty five to 30 percent of all the tools
and techniques that I used to use just

because they're not relevant anymore.

The other 60 to 70 percent is
still highly relevant and those

things still need to be done.

But there's a lot of things that,
we've just got better tools now.

And I think a lot of people maybe
are a little afraid to embrace those

new tools because they've got the
way they've always done things.

But everybody's different.

Nicolay Gerold: Yeah, what are the parts
you've actually are, you have thrown

away and that you would recommend people
throw away from traditional search?

Trey Grainger: It's a good question.

I should preface this by saying
every problem is different.

If I was tasked with, hey, I've got a
thousand documents and their recipes.

And I'm trying to, put them in a
search engine so I can, quickly

find the ingredients for my recipe.

I would say, okay, let's load those
in Solr Elasticsearch OpenSearch.

Let's do lexical search
and let's call it a day.

Because that's not a hard problem
to solve when you're looking for

specific ingredients and the old school
techniques are more than sufficient.

You don't need a vector database,
you don't need embeddings,

you don't need a transformer.

It's overkill.

I should caveat this with saying
Depending on what you're building, the old

school techniques might be fast, cheap,
efficient, and even do a better job.

That being said, for your larger use
cases, your e commerce use cases, your,

more sophisticated enterprise search use
cases, across the board I think a lot of

the NLP techniques, specifically things
that you would use NLP libraries for

OpenNLP things like that, where you're
doing parts of speech detection and

trying to say I'm going to match on the
nouns, or I'm going to match on the verbs.

Those are basically not as useful anymore.

I think, if you've got the ability
to leverage an LLM, it's gonna do

a way better job at understanding
the entire context and picking out

which words and phrases matter.

So those are just a waste of
time in my view in most cases.

I also think that We're entering a phase
now where document interpretation and

understanding and enrichment can be, in
many cases, done way better by language

models than by some of the old school
pipelines that we would have built.

Things like we've got things like
OCR, optical character recognition

for reading PDFs and documents,
very rapidly being replaced.

by things like vision language models.

So if use something like ColPali then
you can even take the entire PDF or

document, pass it in to the language
model, have it interpret it and turn

it into a vector, and all of those in
between steps of parsing and tokenizing

and OCR and all of that stuff just goes
out the window because when you compare

its efficacy to what you get with a
vision language model, the vision language

model just blows it out of the water.

So you're wasting a lot of time doing
things that get you worse results than

if you just didn't spend the time at all.

So those are some examples I think.

Nicolay Gerold: Yeah, and you
mentioned you have to pick like

the right tools and you have to
take a lot from different fields.

One we haven't mentioned yet, which
is like a major topic of your book

as well is recommendation systems.

Yeah.

I want to know what you actually
took from recommendation systems

and brought into your search system.

But first, like how do recommendation
systems differ from the search

systems and where do they overlap?

Trey Grainger: Yeah,
it's a great question.

So I take a slightly different view
than some people, but my perspective

is that there's fundamentally, at least
conceptually, no difference between a

search engine and a recommendation engine.

And what I mean by that is when you
have a search engine, just think of a

traditional search engine like keyword
search, even vector search at this point.

Somebody types in some keywords, a
question, some text maybe some image,

but let's just go with text for now.

They type in some text, and that text
is the context by which they want

to match documents and rank those
documents coming back from the match.

Assuming for now we're just
getting documents back, not,

nothing more sophisticated.

The context is keywords and
you get matched results back.

When I have a pure recommendation
engine, let's say it's collaborative

filtering based, meaning it's based
upon the interactions of users

with products and their behavior.

What do I have?

The context that I have is the behaviors
of the user maybe a profile of the

user, it's things I know about the user.

I take that context, I pass it
to a matching engine to match

and rank and return results.

So if you conceptually abstract
search engines and recommendations,

recommendation engines up one
level, they're both engines that

take context, they match results,
and they rank the results.

And you can say, yeah, you're
oversimplifying it why, but they

have different purposes, right?

Yes and no.

If I think of it not as two different
types of technologies, but I think of

it as one type of technology, a matching
and ranking engine that takes in context.

Then what I can do is I can say
what about all the space in between?

So if I move from a pure
search engine that the context

is only explicitly entered.

by the user at query time, and I say yeah,
but I've got some user context as well.

So why don't I use that user
context to change the way that

I match and rank the results?

That's where you get
to personalize search.

And so you could think of, for example
let's say I've got a restaurant

search engine, and I'm in New
York, and I search for a hamburger.

I would be pretty upset if my search
engine returned me hamburger stores

in Atlanta and Chicago and London.

That's because I'm in New York, the
search engine should hopefully know

about me and where I am, if nothing
else, based upon my IP address.

And every restaurant search
engine will do that, but that's

an example of personalized search.

It's taking information you know about
the user, in this case it's just a

location from an IP address, and using
that to augment how you adjust the query.

And that's a pretty simple example.

On the flip side, if I go from
pure recommendations, and I say

let me augment these with user
context that they're typing in now.

So the user could go in and say
hey, I don't want to include This

document that I looked at before.

Take that out of my
recommendation algorithm.

Oh, also, I would like to
filter down to these results.

Oh, also, let me add some keywords in, and
make sure I fine tune my recommendations

so they target those results.

That's an example of, I would call
it, augmented recommendations.

It's not pure recommendations
just from user context, you're

also taking explicit input.

And so this notion of getting some
input explicitly from the user, usually

keywords, And getting some information
based upon user behavior and what you

know about the user allows you to do
very interesting things in the middle.

And I would say, and we can talk about
this more if you want to dive in, but

being able to take user behaviors, user
searches, their queries, their clicks

maybe products they've purchased,
taking those signals and using those

to generate embeddings that represent
what they're interested in, and then

pairing those with what the user's
searching in terms of keywords and

what that means, you actually get to
this almost sweet spot in the middle.

Cool.

Where you've got as much coming
from user understanding as you have

coming from their explicit intent.

And then you can marry
those in different ways.

And so I, what I would say is, to me
they're, search and recommendations

are fundamentally the same problem.

They're just using different contexts.

And what I really like to do is
think of the entire spectrum.

And depending upon what you're
building target the appropriate

level of balance between the explicit
context and the user context.

Nicolay Gerold: Yeah, and we will double
click soon, but I think you have the

perfect precursor or contextualization for
the three buckets of relevance in search

system, which you go into in your books.

Can you maybe explain like the
three buckets of relevance and the

different search techniques that
fall into the different buckets

and also the intersections between

Trey Grainger: Sure.

Yeah, happy to.

Yeah, highly related to the last topic.

The fundamental, one of the fundamental
questions when someone's approaching

search, or when you're doing retrieval
augmentation, you're thinking about

how to think about the retrieval piece
of it is this concept of relevance.

And going back decades and decades, within
search and information retrieval, we

define relevance as matching the user's,
or meeting the user's information need.

Basically, if you can understand
what the user is looking for and

return that, then you've succeeded
at providing relevant results.

And so the question is,
what is the user intent?

And to me, there's three categories.

of, or I would say three contexts that are
really important for understanding intent.

One of those is the content context,
one of those is the user context, and

one of those is the domain context.

And so if you can mentally
think of this a Venn diagram

with three concentric circles.

In the content context, if I'm looking
purely at the content and nothing else,

I don't care about the user behavior, I
don't care about the domain, that's where

we've got traditional keyword search.

Type in some keywords, I match
on those keywords, and that's it.

No other context taken into consideration.

Like we just talked about, if I
look purely at the user that's

where I'm in recommendations, right?

It's collaborative recommendations, and
I'm just literally looking at what's this

user's behavior, what items have they
clicked on, what items have other users

clicked on, and it's all behavior based.

And so that's pure recommendations.

And then, if you look at the intersection
of those that's where you get personalized

search, which we just talked about
and all of the things you can do with

combining those two contexts together.

But there's the third category,
which is the domain context.

And so if you overlay the domain
context and look at the intersections

there content is You know, what's
in my engine the behavior we talked

about, and then the domain context
is what you understand about your

business, your domain, the topics, if
it's a, say you have a paint store.

It understands colors, it understands
relationships of colors, it understands,

enamel versus I don't know paint
that well, but all the different things

you would need to know about paint.

If it's in the pharmaceutical industry,
you would have a very different

understanding of that industry.

And if you look at just the domain
context absent of the content you

have and absent of the user context,
just the domain context, that's where

you would have a knowledge graph.

And if people are building
ontologies, taxonomies, all those

relationships, you can just roll
those up and say, collectively those

go in and make a knowledge graph.

That is a modeling of your domain,
independent of your content,

and independent of your users.

And so then you look at the intersections.

Okay, if I have a knowledge
graph, that's my context.

And I look at the intersection of
that with my content, what do I have?

In that case, it's semantic search.

So when you marry the domain
understanding, all the words,

terminology, phrases, relationships,
with the content you actually

have, you're doing semantic search.

And so there's multiple
techniques for that.

There's just leveraging embeddings
with transformers, especially if

they've been fine tuned to your domain.

There's also lots of.

semantic search techniques we've been
using for a decade or more around sparse

lexical types of matching, where you're
understanding the meaning of words, you're

expanding the query, you're, if someone
says, barbecue, or top, top barbecue near

Atlanta, then you can understand that near
Atlanta is a location, and then you can

filter down to a radius, and do some query
rewriting to understand those things.

That doesn't require a transformer.

It's one way of doing semantic search.

So that's the intersection there.

Zooming back out.

Content understanding, user
understanding, and domain understanding.

And if you intersect content and
user, you've got personalized search.

If you intersect the content
understanding and the domain

understanding, you've got semantic search.

And there's the third.

Intersection, which is less interesting.

But if you look at all of them
together, the very center where

all of those contexts overlap,
content, user, and domain, is where

you really get at user intent.

And to me, anytime you're building
a search system, whether it's

traditional search, whether it's RAG
for generative AI, you need to have

all three of those contexts in order
to effectively get the most relevant

results to solve solve the problem.

Nicolay Gerold: Yep.

And do you think like all
of these three contexts?

Have an impact across like the entire
retrieval pipeline and there's basically

like the query intent retrieval re ranking
and now like also the like generation

part which is coming closer and closer.

Do you think like each one of those should
have a place in each of these components?

Trey Grainger: I don't want to say
absolutely always, but in general, yes.

One of the things, just zooming out
a little bit, that I find the most

challenging with the current era
we're in, which is lots of focus

on vector databases, lots of focus
on transformer based encoding.

Most of those encodings are text
based, or even if there's images

and it's multimodal you might get
You'll get the content context.

You might get the domain context if it's
general or if you fine tuned to your data.

But almost none of these systems
are getting the user context.

Because it's hard.

Because in order to get the user context
and understand your users, you have to

collect their queries, you have to collect
their clicks, you have to collect what

they purchase and act on, and if you do
it well, you actually need to collect the

things that they don't click on as well.

You need to know what they saw but didn't
click on to know what they looked over.

And so there, there's a whole
realm of building click models

to try to train some of these
things, which we could talk about.

But the reality is, the reason that
people are using these RAG platforms

and I would say almost formulaically,
these libraries that sort of give you

out of the box RAG, is because it's easy.

Not that LLMs as a concept are easy,
though they're very sophisticated, but

being able to write a few lines of code.

encode my text using some off the
shelf model, put it in a vector

database, query it, and get results
that look somewhat like magic is

easy for someone to, to implement.

Adding in the user context and doing
all this collection of user behavior

and processing it is a lot of work.

Therefore not only do most people
not do it most people I've talked

to who are, Working on startups in
this space don't even know about it.

They don't even know to ask the question
of how should I be incorporating my

user behavior into into the system?

Nicolay Gerold: Yeah, and I
think I will be just rattling

off like a few techniques.

I understand domain understanding.

You can use across the board.

For example, you can even use
graph embeddings for your search.

You can use them for re ranking by
furthering the results and looking at the

different entities which are used, for
example, in content should be very clear.

But I think like personalization is Mhm.

Probably, or the like user bucket is
the most difficult, especially because

architectural wise, it involves a lot
of different components as well from

the client to basically your database.

If you actually think about it, like
the, from a systems perspective, what

are the different components which are
actually necessary to implement this user

understanding or use a loop completely.

Trey Grainger: Okay I'll start with let's
say you were gonna go build this tomorrow.

What would you need to do?

Then I'll get into how you apply it.

First thing as you mentioned, there's
a user interface component, right?

The users interacting, they're
clicking on things, what have you.

You need to collect that user interaction.

So what does that mean?

Probably some sort of
a JavaScript library.

There's stuff like Snowplow out there
you can use to collect user behavior in

the browser and pass it to the back end.

But, what you ultimately want to be
collecting is every query the user runs.

The results in the order they were
returned to the user whatever the user

clicks on and then any other subsequent
user signals like, purchases, add to

carts if you can get it, things like
returns or, posted a bad comment on

a on a review site or what have you.

Any signals, behavior you can get from the
user about how they liked and interacted

with the product as a result of that
query is fair game for as a signal.

So you have to collect those things,
know what to collect, know how to

format them, know where to save them.

Once you've done that, and that's
like step one, is collect the data.

Once you've built that though, which
is not trivial, but it's doable.

And in the, in my book, I've got a
whole section on the right format

to store it in and how to save it.

Also the open search project recently
built a add on maybe it's going to

be part of the core at some point
called user behavior insights, which

is basically a schema for how to
collect this data appropriately.

And they've open sourced it and I
think they even have implementations

for some other engines like solar.

So I think they're trying to more
create a standard for the industry

in terms of how to do this.

But regardless of whether you build
something bespoke or try to leverage

something that's more standardized,
you have to collect the data.

But then once you've done it,
you have the ability to build all

kinds of models based upon that.

So some of them are personalization based.

Based upon a particular user and what
they've interacted with, you can look

at their, you can look at all the
products they interacted with and

average those vectors together and
then use that as a, something that sort

of represents that user's interests.

And then the next time they run a
search, you can, boost it up or down

a little bit based upon, interpreting
that vector in the context of the

user's vector, that's one way to do it.

Another way is just to
generate recommendations and

use those to boost results.

There's, you can apply guardrails
where you can cluster all

of your documents together.

Say if it's e commerce, you cluster
it into maybe a hundred categories.

And then from there, you can personalize
within those categories when they search

for things within those categories.

So as an example, if, I've got this one
in the book, but if somebody goes and

searches for a Hello Kitty water bottle
or a Hello Kitty plush toy or something

like that, and then they search for
a, Black GE electric razor for their

face, and then maybe they go search
for a stainless steel Range or stove or

something like that and then their next
search they search for microwave That

user is probably not looking to see a
Hello Kitty microwave That thing does

actually exist in my sample dataset.

They're probably not wanting
a Hello Kitty microwave.

They happen to Hello Kitty, and they
wanted a water bottle or a plush toy, but

they're, when they looked for appliances,
they were looking for a stainless steel

range, which means probably they want
to see a stainless steel microwave.

Microwave.

Similarly, if they search for a GE
razor and they search for a stainless

steel Samsung range, they're probably
not looking for a GE microwave

just because they got a GE razor.

Because the situation is there, there's
two different, very different categories.

And you have to make sure that when you're
personalizing, you don't over personalize.

Because there's no better way to
make your users really angry with

you than to stick them in a bucket
and get them stuck in that bucket,

which is not their actual intent.

So personalization is something
you have to be careful about.

Nicolay Gerold: So basically the
cluster would allow me to basically

cluster the behavior or whatever
he did and the results that were

returned into certain areas.

And when the cluster is too divergent
from the current query, I can

basically decide to exclude the
results which are also in this cluster.

The example being like
the Hello Kitty microwave.

Trey Grainger: Yeah, and in particular
the way I usually build this out is

if I'm focused on personalization, say
a user runs a search, I generate an

embedding I do my inspector search to
find results I will get a personalization

vector for that user that has all of
their relevant behavior but what I'll

do is I'll filter that Set of items
down to only items that are within the

cluster of the user's query or maybe
like the five or ten nearest neighbors

depending upon, how many clusters I have.

So I can find either in, in the exact
cluster or in a very nearby cluster.

And then I'll only personalize
if I know that the behaviors

are within the same category as
the kind of query that came in.

If that makes sense.

Nicolay Gerold: So basically you contrast
the result sets, you run a regular search.

For example, you run
the personalized search.

So basically like a recommendation
system with a personalization

vector and you compare the results.

Sets, and then there is
no overlap or a very low.

Check our distance, you actually tend to
disregard the personalization results.

Trey Grainger: Yeah, so I actually
for my personalization vector, I'm

actually filtering out the I'm filtering
out the behaviors that are not in a

cluster that is similar to my query
before I even apply the personalization

Nicolay Gerold: interesting.

Trey Grainger: So think of it like
I'm doing on the fly recommendations

and I'm choosing the data points.

that I'm using for the recommendations
based upon the user's query and filtering

down to only behaviors that are similar to
the user's query or in similar categories

as the results of the user's query.

So it's what you said, but I'm
actually doing the filtering at,

pre match pre matching and ranking.

I'm doing it up front to filter down the
behaviors I'm going to use to personalize.

Nicolay Gerold: Yeah, and I think that's
already leads us into the next question.

How do you actually handle the conflicting
signals between the other buckets as well?

So basically, if there is a
conflict in the domain and the

user or domain and the content.

Trey Grainger: Yeah it's
a very good question.

There's a couple of
different approaches to this.

Let me lay some groundwork here
and maybe give a few concepts

and terms before I answer this.

I mentioned signals and collecting
the signals and that you can

use them for personalization.

That's one way to use them.

I tend to think of ranking
algorithms in a couple of buckets.

One of those buckets is what I would
call a popularized relevance algorithm.

So this is something
like signals boosting.

So in the context here say someone's
running a search for iPad and they

get a whole bunch of search results
back for iPad chargers and iPad

dongles and just not great results
because the search engine natively

doesn't know, what an iPad is and,
how to boost one of what's popular.

So what you do with signals boosting
is you take all the queries for iPad.

And then you see which results
people are actually clicking

on, purchasing, favoring.

And then, with enough signals, you can
actually know exactly the right answer.

Not by doing some sophisticated
machine learning algorithm,

but literally by counting.

You say, for this query, what
are the top results people

are purchasing that they want?

And then you have your top page or two
of results that you show to the user.

The user's literally telling you,
these are the results we want.

The users, and then you show those
results, and if you do anything more

sophisticated than that to try to
figure out, what you should be showing

your user, you're probably going to
lose sales if you're in e commerce.

So that's signals boosting.

It's popularized relevance.

It's just looking at what the popular
queries and popular items are for

those queries and just, using the data.

So that's one, one level.

Another level is personalized relevance.

So that's what we were
talking about a minute ago.

Whether you're generating recommendations,
whether you're so it's a way of

interleaving them and the results
as recommendations or whether you're

actually taking the personalization
vector that represents the user

and using it to adjust the results.

That's where you understand
the user and then you cater the

results to the specific user.

And then a third category is what
I call generalized relevance.

And so this is your traditional
learning to rank or machine

learning ranking approaches.

Where you're taking lots of queries
and lots of results and you're either

automatically learning from the
signals, a click model that you can use

as implicit judgments to train your.

Machine learning model, your
learning to rank model, or you're

getting explicit user judgments
by having people annotate data.

Either way, you build this model that
takes lots of features, it looks at the

judgments of query to document pairs,
and then it learns which features

matter in general across your domain.

So the generalized relevance model
is able to take queries it's never

seen before and to, apply a ranking
algorithm to rank results based

upon what features in documents and
queries normally matter the most.

And so I tend to layer these where you
there's actually one other category which

I'll briefly touch on which is a semantic.

Category where you can semantic relevance,
where you can learn from user behaviors.

For example, they typed, somebody
typed in the query manger.

Usually after someone types in manger,
they type another query that is manager.

So you learn that there's a
misspelling not from your content,

but from actual user behavior.

Every time somebody types in iPad
with a misspelling, you correct it.

So you learn what the synonyms to things
are, what the acronyms for things are,

what the misspellings are, and you can use
that to help interpret queries as well.

But if you just look at
maybe the first three.

If you think of ranking as a not a
pipeline, but as a sort of, um, as

a hierarchy, at your base level, you
need some kind of a base matching.

It can be for lexical,
it can be BM25 algorithm.

For dense vector search, it can
just be cosine or what have you,

but you need like a base level.

ranking algorithm and then on top of
that you have this generalized ranking

algorithm like learning to a learning
to rank model of ranking classifier on

top of that you can apply the signals
boosting and then on top of that you

can apply the personalized search and
some people will take all of these

things And throw them in a pot, almost
a witch's cauldron and they just stir it

up and they apply like XGBoost or some
machine learning algorithm and they say

whatever comes out is the magic potion.

That's my ranking algorithm.

You can do that.

But.

There's things that if you keep
them separate there's ways that you

can apply them in different ways.

Or if we think of it almost
like a pipeline, right?

You can try something,
see if there's results.

If not, you try the next thing and you
can put them together in different ways.

So I, I tend to actually do the latter.

It's a little bit more hands on, but
it gives you fine grain control over

what you're going to apply when.

If you want to turn
something off, turn it on.

Your one big model isn't overburdensome,
because you actually have all

these knobs that you can tune.

And so I tend to think of them as a
stack, and a set of techniques that

you pick and choose when you want them.

You can definitely combine them together,
and you would do that in your generalized

model, in your ranking classifier.

And, there's other things like
cross encoders, which are a form of

ranking classifier we can talk about.

But there, there's just all
these different options.

And my preference is not to
just throw them in a cauldron

and then get a potion out.

My preference is to intentionally
analyze and use each one where it makes

sense and to layer them appropriately.

Nicolay Gerold: Yeah, and this
also gives you the capability to

use query understanding in a much
more sophisticated way in the end

that you actually have based on the
classifications you're doing, you have

different pipes you're running through.

For example, just for head queries,
you might not even have to run

like anything sophisticated.

You just put it into a cache
and just return whatever is

in the cache and that's it.

Trey Grainger: Your top most popular
queries, you should already know the

answer to those before you run them, if
you're not doing user personalization.

But if you want to do user
personalization, then okay, that's going

to change the dynamic and maybe you start
with those and then personalize those.

Also with personalization, I
mentioned earlier, you want

to have kind of a light touch.

If you over personalize,
people get really frustrated.

So usually when I personalize results,
I don't personalize all the results.

I'll take the first page of results, and
I maybe personalize the second result,

and maybe the fourth and fifth result, and
then I leave the rest of the page alone.

Why?

Because I'm hopeful that my algorithm
for understanding user intent is really

good, and they're going to see the
second, fourth, and fifth result, which

are usually above the fold, and be like,
oh, hey, that's the thing I wanted.

This engine seems like it's
really smart, it knows about me.

But if I'm wrong, and the whole page is
Hello Kitty microwaves, and, Things I'm

not looking for then the user is going to
leave and be frustrated and potentially

never come back So I think for something
like personalization It's useful to keep

the algorithm on the side a little bit and
use it to interleave Results as opposed to

having it take over as a core piece of the
overall ranking system if that makes sense

Nicolay Gerold: Yeah, and when.

You actually think about this, like what
are the key components, especially in code

that you're setting up to implement this,
because when you don't do this carefully,

it's like just a jumble of if else or case
switch conditions, which basically pipe

the user request into a certain direction.

What's your technological setup
to actually keep this into

structure in an understandable way?

That also allows you to figure
out, like, where the mistakes

are in your different pipelines.

Trey Grainger: Yeah, so The answer is
it depends and it varies based upon the

technology stack people are using what
have you but I'll tell you my go to and

a lot of this is in the book as well.

I so I tend to take a
query when it comes in.

I'm going to ignore document
understanding for a second, but in

terms of query understanding, I tend
to take a query when it comes in and

have a a knowledge graph of known
entities, known things in my domain.

So I mentioned a query earlier,
it was like, top barbecue in

Atlanta or something like that.

I, what I'll do is I'll take the query
in, and the first, the very first

thing I'll do is I'll take it and
I'll pass it to a an entity extractor

but where the entity extractor
has my, my knowledge graph in it.

So I'll say, hey, do I know
what the word top means?

Oh, actually, yes.

I've got something in my system
called a semantic function.

And a semantic function says the
word top means in the context of a

restaurant search, popular restaurants.

What does that mean?

It means, five stars or the
higher the stars, the better.

And so my system will say,
oh, top is a known word in my

domain understanding context.

In my knowledge graph, it's a known word.

That is going to change
how I interpret this query.

And so from that point on, down the
line when this final query goes to

the search engine, I'm not searching
for some embedding vector that roughly

corresponds with some semantic notion
of what the word popularity means.

No, I'm searching for boost my results
by the number of stars because the

user wants popular restaurants, right?

So that's what top means.

When I go to barbecue,
okay, this is a word.

It's not in my knowledge graph.

I don't really know what it means.

So then I'm going to actually Go to
something like so I've got something in

the book I talk about, which is a semantic
knowledge graph, which actually just uses

the index to say, Hey, what are the other
most semantically related words to this?

And it brings back things like barbecue,
brisket, pork, ribs, things like that,

but it can also bring back a category of,
Hey, this is, the category is barbecue

restaurant or Southern restaurant or
something like that, or home cooking

or whatever the category might be.

And so it can help both.

Boost the category and or filter the
category as well as expand the term.

So that's a sparse lexical expansion.

It's similar to what you
would get with something like

splayed or similar techniques.

You can, in fact, you could
use splayed for that piece.

You could say, hey, I don't
know what barbecue is.

Let me go ask splayed.

Then let me expand here to a a
splayed field if you know what

splayed is to, to the listeners.

So you could do that there, and then
when I go to near Atlanta same thing,

near is a semantic function, that means
that if the thing after it is a city,

then get the lat long and do a radius
search for however many kilometers

around the city, say 50 kilometers.

And then my query to the engine actually
becomes I, no embeddings whatsoever

necessarily, it becomes an interpreted,
interpretation of that query that

explicitly maps to, A 50 kilometer radius
near Atlanta, boost the most popular

restaurants, find restaurants that
have the word barbecue, or are in this

category of things related to barbecue,
and here's some other related terms.

And that will get you a very tight data
specific data in your index specific

answer that almost perfectly expresses
the meaning of the user as opposed to

Let me take top barbecue near Atlanta,
pass it to a transformer encoder.

And it's going to give me some semantic
understanding of, popular things and,

things having to do with Atlanta.

Which would include the Atlanta Hawks,
which would include, the location, which,

it's it's going to under, it's going
to understand what you're asking, but

it's not going to be able to get the
right results from your search engine.

Unless it's been very explicitly trained
on how to query your search engine to

get those results, if that makes sense.

And so it's I want to say
it's a next level above LLMs.

It's not.

It's just a different approach to try
to focus on getting the user intent.

So to answer your question about the
text stack at any point it, where

I was interpreting barbecue and,
getting that, sparse lexical vector,

I could also do a dense vector there.

I could take the word barbecue and
expand that into a, dense vector

representation for an embedding,
and do that as part of my search.

I could at any point in here I can
go back and forth between sparse

lexical and dense vector capabilities.

But the answer to your question was like,
the tooling and how do you do this stuff

and my answer is I've got my knowledge
graph, I have an ability to interpret

the query based upon the knowledge graph,
and then whether I use dense vector

embeddings, sparse vector embeddings,
or some other interpretation technique.

They're all just tools in my
tool belt that you can apply

to the to the problem at hand.

Nicolay Gerold: Yeah, and I'm
really interested, architecturally,

the personalization component.

If you retrieve the different
user behavior vectors, in the end,

based on a user ID, how do you use
those, especially with, in light

on the clustering we mentioned?

Before as well, where you actually, you
retrieve those and then you have the

query and how do you filter out what
user behavior vectors to actually ignore.

Trey Grainger: Sure.

Okay let's go back to the microwave
Hello Kitty example from earlier.

Walk through it, so a user comes in
they've run a bunch of searches I forget

what I said, like Hello Kitty, plush toy,
or water bottle of white GE electric razor

a handful of other things, and then they
search for a stainless steel GE range.

So now they come and they
run a search for microwave.

So the question is, okay, we
could just do content context, we

could just run a keyword search
and find all the microwaves.

Or.

We could try to understand what
a microwave means in the context

of my domain and filter down
to appliances so that I'm not

seeing microwave cleaning kits.

I'm only seeing actual microwaves.

I could do that.

That's more of a semantic search, right?

Or I can take the personalization
context, which is what you asked about.

So what I would do in this case to,
I would do both of the first two, but

I would also for the personalization
piece of it, I would go in, I would say,

okay, I've got a query for microwave.

Let me take that and pass that
to, assuming I'm doing a dense

vector representation here, let
me pass that to my transformer.

I'm going to get an encoded version of
this as a dense vector and with that dense

vector, I'm now going to figure out in
my so I've previously clustered all of my

products, and I've got a hundred clusters.

I'm going to now find the nearest neighbor
to this vector in vector space, and

I'm going to say, oh, hey, the nearest
neighbor is, I'm going to make this up.

A kitchen appliances.

Great.

So found that and then I might use that
or I might say, Hey, actually find me

my other four, four more around that.

So I'm getting my top five nearest
neighbors to the microwave embedding.

So I might get, kitchen appliances,
small kitchen appliances, large

household appliances, whatever
those clusters are, right?

So that will make up the sphere of.

What I care about, and then to
personalize, I go to my either

my service or wherever I'm
storing my user behaviors, right?

All their signals, I then go
say, get me the list of signals.

Maybe it's one, maybe it's a hundred,
maybe it's a thousand, whatever

the previous user behaviors are.

Just filter those down to these clusters
and then now maybe down to five or

six or, a smaller number of behaviors.

Those behaviors, which in this
case I'm representing as vectors.

Each of those the vectors representing
the items that have been interacted with.

I can then take those and say these
are the relevant behaviors that on

behalf of the user we think the user
would care to be considered in the

context of this personalization,
context of this one query.

So I take all of those, and they all
have to do with kitchen appliances,

and I say, all right, let me average
these together and get a personalization

vector that represents the user.

It's like a user profile, but it's
contextualized to that one query

that they're running right now.

And then I say, okay, let me take
that, and within the context of

this vector, I probably have things
represented like, Stainless steel,

maybe a price range if they looked at
things within a certain price range.

Maybe a brand, GE if GE is somehow
represented in my embedding.

And so all of those concepts
based upon their past behavior.

Get bubbled up in this
personalization embedding.

And then I apply that as part of
my query to, boost results up or

down, or potentially just to run
a separate personalization query

and interleave like I talked about.

So that would be the general flow
for how I would go from users

tumping a query now to getting their
contextualized behaviors brought back.

That would be the flow.

Nicolay Gerold: That's so interesting.

I want to touch quickly on the different
search engines, especially like

OpenSearch, Elastic, Solr and Vespa.

What are, like, the scenarios
where you really would say you

would pick one over the other?

I'm not interested in exact
trade offs, but rather what are

scenarios where one really shines?

Trey Grainger: Yeah so I'll preface
this with saying I like them all and,

try to be generally vendor neutral.

I, I do search consulting and happy to
work with anybody, work on any of them.

But it's interesting.

I'm going to be a little PC
on this answer, so sorry.

I, so I told you I got started with Solr.

I love Solar.

Solar's near and dear to my heart.

It's an Apache Software
Foundation project.

It's great.

It doesn't have the commercial
backing that the rest of them do.

And so the development and upkeep, I
would say is slower than I would like.

And so I definitely see some of
the other engines surpassing it.

And it's the default in my book, I'm
actually using, so the book's got solar,

open search and others that you can use
with the book, but it's near and dear to

my heart but depending upon what you're
looking for it, it, there might be other

technologies that either are faster,
have some capabilities, things like that

so Elasticsearch came about to replace
solar they, took a lot of market share

a lot of really good technology there.

They had, issues with licensing where
they went non open source for a while

and now they're technically open source.

The license is not my favorite license,
but but as a result of that, Amazon

forks Elasticsearch and has OpenSearch.

At this point, I would say not
enough time has gone on that I would.

Draw a clear differentiating line
between Elasticsearch and OpenSearch.

OpenSearch is doing some new things,
Elasticsearch is doing some new

things and both are really cool.

So I think both are great projects.

I I like the governance
model of OpenSearch better.

They've got, the OpenSearch Foundation
is part of the Linux Foundation.

I just, I love open source.

I just feel confident about it.

But Elasticsearch has great technology.

They've got great engineers,
really smart people.

And they do good work.

And so if you're in that, if you're in
their ecosystem, in their sphere like I, I

think you can't go wrong with any of them.

Vest was an interesting one, because
they they've been around for probably

the longest of any of them because they
were the Yahoo search engine but they

were internal in Yahoo, then they split
out there's like Verizon Media, all

that kind of stuff, but they internally,
they split out, then it became an open

source company project, and then they
became an open source company, and then

they raised a bunch of VC capital and I
look at Vespa and the, what I generally

see in terms of, my work with it and
in terms of implementation is it's

like everything in the kitchen sink.

It's very powerful.

It was written by people who are
very smart and think about search

in a very, I don't want to say
correct way but like it's very clear

that they're somewhat visionary in
terms of how they've designed it.

But it's also intimidating when people
go to try to get started with Vespa,

they're like this like I'm talking in
the terminology of vectors and dimensions

and it's like from day one, there's
like a learning curve to get over.

And so what I tend to see is people
who hit a limit with some of the

other engines will go to Vespa because
they're like, Hey, I need all this

power and it's worth the learning curve.

And people who are trying
to, approach search, not.

Maybe without as much sophistication, just
might be scared of it because it's a lot.

And that's not to say that Vespa is the
best at all things because it's not.

It's just to say that it's
very capable and it's getting

more approachable over time.

But I think, at least in, in my
circles I think Vespa is very powerful.

And to the point where it
intimidates some people.

And I think that'll change
over time, I'm sure.

But I hopefully I'm not saying
anything that anybody would

disagree with or be offended by.

But they all have their place
and they're all really good.

And they also all, I should
mention leverage each other

as competition to get better.

Vespa is focused on getting simpler
Elasticsearch, I'm going to say,

the open search pressure forced
them to go back to open source.

Maybe they would claim
there's other reasons, but.

Either way Solr, when Elastic
came about, changed a lot of its

architecture to meet the demand.

So I think it's really good
to have a lot of options.

It's really good to have competition.

And I think everyone can see,
their strengths and their

weaknesses and try to minimize
weaknesses and focus on strengths.

But they're all great.

I recommend all of them for
a project, depending on the

project and what your needs are.

No, no bias towards, any one particular.

Sorry.

Nicolay Gerold: And what are the
most exciting or promising things

on the horizon for AI poet search?

Trey Grainger: The most
promising things on the horizon.

Ah, man there's so many things.

I write about a lot in the book, and
I think that my message to people,

whether you're getting into search
afresh because of generative AI,

whether you've been here for a while
and you're trying to get more of the

AI aspects is there's a wide spectrum
of tools and techniques and approaches

many of which we've talked about today.

And it's really useful to familiarize
yourself with them and try to apply them

where it makes sense and just to know
what's possible and then how to do it.

So that's start, the starting point is
there's a lot of unexplored territory

that most people just need to explore.

But beyond that, in terms of
the future I'm really excited

about a couple of things.

One is these late interaction
models like Colbert Copali.

Where it's funny, we we, so that's one
of the, we've gone from a bunch of a

bag of words and a bunch of keywords and
I'm matching on each keyword and then

adding up, how well each of my keywords
matched to, and we shifted the pendulum

to, Hey, let's just take the whole
document and turn it into one embedding.

That represents the meaning
of the entire document.

And then we decided, oh wait,
actually we're losing too

much, nuance from the document.

So let's shift back now, and
let's start chunking documents.

This sentence gets an embedding,
and this sentence gets an

embedding, and this sentence.

And then we're like now each sentence
doesn't have the context of the

whole document or around it, so now
let's contextualize the sentences.

Overlapping chunks and taking chunks and
doing contextualized embeddings where

I take an embedding of the document
and an embedding of the sentence and

I combine them together to have the
sentence in the context of the document.

We're doing all these weird amalgamations
of trying to split apart the document

in some kind of semantic chunks.

And it works, but it's a
lot of pain and, it's a lot.

What I like about something like Colbert.

And this later interaction model, is
it goes back almost to the original

idea of mashing on individual keywords.

But it does it with embeddings.

So it goes through a document and
says every keyword in this document,

I'm going to create an embedding for.

Similar to what we used to do
with Word2Vec back in the day.

Every word has an embedding, but
unlike Word2Vec, which always has the

same meaning of every word and it's
not contextualized by surrounding

words, with Colbert, we give each
word a contextualized meaning

based upon the context surrounding
it in the rest of the document.

So each word gets its own contextualized
embedding based upon its particular

meaning and interpretation there.

We do the same thing with the query.

So the query comes in, we
take, each individual query

word, we generate embeddings.

But what we can then do is we take all
of the embeddings, all the words in the

query and all the words in the document.

Represent them as embeddings, and
then compare the similarity of

each of those embeddings as we go.

And then just take, for each word
in the query, we take the maximum

score of any word in the document,
and then we can add those together.

It, it's not that, it's not complicated
at all, but it does take a lot of extra

storage to store that many embeddings.

But what it does is it allows
us to go back to a I'm

matching on keywords again.

I'm matching and ranking on keywords,
but I'm interpreting the meaning

of those keywords based upon the
semantic context around it as opposed

to just is this string there or is
it not, which is what we did with,

traditional Boolean matching in VM25.

So I and those models
are vastly outperforming.

These other approaches that we're trying.

And when you look at Colpally
versus all the OCR stuff that we

talked about before it's massively
outperforming what was there before.

So I think just, not that we're getting
back to the roots, because this is a new

technique, but this notion of thinking
of words as, contexts of meaning.

It is like clusters of meaning, and they
have some context and matching on the

clusters of meaning and how they combine.

I think is core to what it
means to interpret language.

And I like that these new approaches
are, I think, more appropriately

focused on interpreting language as
opposed to just using an embedding

model and expecting magic to happen.

Nicolay Gerold: Yeah, and if people
want to start building this stuff

we talked about, what are, in your
opinion the best resources out there?

And I'm already gonna say your book
is probably the best one I've found so

far, so I'm gonna throw that in there.

So maybe shout out, like,
where people can find that?

Trey Grainger: Yeah.

For the book.

Yeah, so it's Manning is the publisher.

So if you go to AIPoweredSearch.

com You can buy the book.

Also, it hasn't launched yet, because
the book's literally gone to the presses,

and probably a couple weeks, and you'll
have it in your hand if you've bought it.

But we also have a website
launching for it, and part of that

website is actually a community.

One thing I'm trying to do is make
the book not just not just a book,

but also to create it as a resource
to have a website in a community where

people can join and actually have
discussions and talk and interact and

discuss the latest and greatest and
the forefront of AI powered search.

And so we've got like three, we've
got maybe 400 people members who've

joined already, mostly search experts
and people from conferences and things

like that, but in terms of resources.

The book is a key one the website, but
I'm trying to create a larger community

around AI powered search, the concepts
in the book, and really promote the

companies and the people in the space.

I'm hoping that will be, one
of the best resources in our

field for this going forward.

But otherwise, yeah, I'll shout out
to all the companies in the space.

Who are doing, massive evangelism.

All of them do it, OpenSearch,
Weaviate Vespa, everybody.

I would say that if I look at the
quality of the content coming out of

or at least the amount of really smart
people who are trying to explain this

to people companies like WeV8 do a
really good job with, their evangelism.

Vespa I think you had Joe on your Joe on
your He's been on the podcast recently.

He's super bright and does a
great job of evangelism as well.

There's so many people
from all these companies.

I shouldn't probably be calling out
specific names because there's so many.

But I hope that with the book and
with the website, we can actually pull

a lot of these people in and try to
like cross post and share a lot of this

content to just really get it out there.

Because there are a lot of people
who are trying to get into search.

Get into retrieval and try to understand
it and the resources are out there, but

they're definitely spread across, you
know everywhere different companies blogs

and YouTube channels and I think that
having a resource where we can pull it

together and share will be really helpful

Nicolay Gerold: Nice, and I will
put all of that in the show notes.

If people want to follow along with you
personally where would you point them?

Trey Grainger: Twitter, LinkedIn
Trey Grainger I've got a company

called Search Kernel I'm going to
be hanging out in the AI powered

community, so if you go to community.

aipoweredsearch, all one word, com you
can pretty much catch me there any time

if you want to ask questions or hang out.

That's where I would go.

Nicolay Gerold: So what can we take away
when we are building search applications?

First of all, RAG is not
just a vector search.

You want to build RAG as a
continuous feedback loop.

So generative AI helps to interpret
queries and summarize results and

also add for example, metadata
filters to improve the search results.

The retrieval provides up to
date and accurate information

to generate better results.

So each component enhances each other
and we can even loop, we can do like a

chantic rack where we ask the LLM whether
we have enough results and basically

run a new retrieval with a newly or
generated query or just an adjusted query.

And through that, we have Like way more
capabilities than this only vector search

component and then we can basically
layer additional search techniques

on top like for example a traditional
lexical search or more personalization

through personalization vectors and
stuff like that So we basically build

a pipeline So first, like, basically,
query understanding, you interpret the

query, you classify the intent, you
extract entities, you extract filters.

So, for example, type filters
are a very common one.

Then you retrieve, you use
multiple retrieval strategies.

You use keyword search, you use vector
search, and then you basically combine

them, you fuse them, and re rank.

And that's it.

At domain specific filters.

So here you could place your
knowledge graph and lastly, you

enhance the results with a lens.

You can summarize them.

You can gather additional context.

You can generate follow up questions.

And then lastly, you have like the
iterative refinement, like a loop.

you don't have an adequate answer
to the user query, or you don't have

enough results or enough good results
that you can actually answer a query.

You reformulate it and you run an
additional retrieval, or if it's

enough, you generate a final response.

And I think this is a really interesting
like architecture pattern because it's

also, it creates components, it creates a
separation of concerns, and you can also

Decide okay, what is actually necessary?

I think Trey showed like the different
architectural components and how you

could combine them and for you what's
most likely is interesting is like

Picking a few which you think bring
the biggest bang for your buck and

just implementing those and Which are
actually most suited for your use case.

For example, in e commerce, you
might not want to use like this

iterative refinement or like really
expensive result enhancement.

Also, you won't use it because
it will add too much latency.

So you take the components that
actually suit your use case.

And you build a pipeline like that.

And I think like having an overview
of all the different things you

could do and for what they are
well suited is really interesting.

And this really leads into like his
three critical context frameworks,

which help you actually decide, okay.

In my type of search system I'm
building, what is actually relevant?

Because his three contexts, like
to summarize again, or paraphrase,

you have content, you have
user, and you have to domain.

The content basically is
the raw document processing.

The user context is the behavioral
patterns and preferences.

The domain context is like the
business and domain specific knowledge.

And for business and domain
specific, there might be a mismatch.

What is business and domain specific
for your user and for your company?

And, Then you basically have the
implementation of these different

contexts, like for content, vector
embedding, keyword indexes, document

structure understanding, for user,
click tracking, query logging,

purchase, interaction history,
session behavior, user clusters, for

domain context, knowledge graphs,
entity relations, business rules,

taxonomies, semantic functions.

And you basically have these
different technologies.

And basically like a tool set
and then you look at your search

application and look at especially
like, okay What's going wrong?

Like My users in the front end are
not really clicking the results,

but where is it coming from?

Am I serving inadequate result?

Like is the content actually bad
then I might have to fix my content

is Is it not personalized enough?

Then I have to improve the user context
and basically treat it as a grab bag of

different techniques that you can layer
with each other to find a solution.

And in most cases, it won't be like one
or the other, but a combination of those.

So you need to combine them in the end
to build like a complete search system.

But you always fix the
biggest issue first.

Like, where it's burning the most.

Is it content garbage, is
it search garbage, or is it

the personalization garbage?

Pick this, fix it, go to the next.

And, this is like, one of the
things he mentioned he often sees

when people do this, is like this
witch's cauldron anti pattern.

Which, is, Is that people are using
like one monolithic ranking model So

you take whatever signal you find like
bm25 vector similarity um popular items

click through rates conversion signals
user preferences recent behavior of the

users you take all of it and Slap it into
like a massive model And basically hope

something good comes out of it And it's
all good and well if it works, congrats.

But if it doesn't work, you will
have like a lot of troubles because

you can't debug it and you will need
to retrain the entire model from

scratch to actually try to fix it.

So your iteration loops are way longer.

So instead, you shouldn't throw
everything together in one part, but

rather separate the different components.

Find smaller little fixes
and componentize it.

Like, this is also like, it's really
software engineering ask that you want

to separate the different components,
isolate them so you can easily trace bugs.

You can tune them in isolation and
you can easily AB test adjustments

and also make it easier to maintain
because you can change a similar

system or a singular component.

And he really advocated this like
layered architecture, um, with a

different signals, um, which can
be turned off and on, which is

something I found really interesting.

So you basically have that
base layer, which is always on

like BM25 and Vector Similarity
Search, which does like the split.

basic, basic relevance scoring.

I mean, these are like your
core matching criteria.

Then you have like the middle layer,
which is more like a signal boosting,

which could be like popular items,
click through rates, conversion

signals, historical performance as well.

And then you have like the
top layer, which is more like

the personalization part.

So user preferences, recent behavior,
category of affinities, or The positions

and you keep these independent and can
then switch switch them on switch them

off also on the query type And use them
together Isolate them and you really

can see okay where are my results coming
from I have Uh, basically in my result

set I delivered to my users like he
clicked on one of the bottom ones so

I Got them out of the base layer, but
my personalization layer or my signal

boosting layer basically downranked
it So basically you can see, okay,

where might my issue be coming from?

And then you can tune the component,
which actually was in the wrong.

And one more part in
that is basically his.

This is the smart personalization.

Um, the, this technique he mentioned
is basically the user behavior

clustering, where you can create
different category clusters and map

user interactions to the clusters.

I think this was a really
interesting approach.

I actually, I think I need to try to code
it up to really understand it, but this

could be something really interesting
and something pretty easy to implement.

And maybe to note last, I think
we already went a bunch into

that, like, modern search systems,
like, where it is evolving.

And we are really getting into, like,
a more contextual understanding.

With, for example, cold
poly, but also cold word.

Where we have, like, word level or
image patch level understanding.

And we have embeddings for each.

And we can maintain word relationships.

And also we can trace back,
okay, where might it be coming

from and why is it relevant?

And this is much more compute intensive,
but also leads to way better results.

And I think we will see more and more of
that, especially as the technologies are

getting better to actually deploy them.

Um, and this could also mean
that we might get away with worse

retrieval systems, but with more.

Powerful re ranking systems.

I think this really fits into the general.

theme of generative AI where we have
more inference time compute and so we

over search, so basically we retrieve
a lot of documents and then we run them

through a re ranking system and when
it's efficient and fast enough, um, we

might actually get very decent results or
way better results than we do currently

when we are just using semantic search.

Yeah, that's it.

Um, I will stop talking now, I think this
is already the longest episode we had.

Um, I really can only endorse
buying the book from Trey.

Um, I think I read it twice right
now, once when I bought it and

once when I got Trey on as a guest.

Um, and, um, it came out a few
weeks ago, like the final version.

And get it.

I think it's available everywhere.

Also join Trace community
here for a search.

I'm also in there.

Um, if you want to chat.

Otherwise, we will be
continuing our search next

week and always like subscribe.

It helps a lot.

Leave a comment, um, especially
on the podcasting platforms.

Otherwise, I will see you
next week and see you soon.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#038 AI-Powered Search, Context Is King, But Your RAG System Ignores Two-Thirds of It

Subscribe