· 01:33:44
Nicolay Gerold: Time is probably one of
the most ignored dimension, especially
within rag and knowledge graphs.
But time shapes the data
validity and also the context.
So every data point exists in a
temporal context that affects its
reliability but also its usefulness.
And a common pitfall, so for example in
cyber security, when we use a week old
threat intelligence, we can miss new,
for example, zero day, zero day exploits.
In legal, if legal teams reference
an outdated contract or outdated
regulation and laws, we might
risk really costly disputes.
Engineers which work on a stale API
documentation introduce bugs in their
system and the API contracts break.
So we really have to like
timestamp all data operations,
ingestions, updates, observations.
So at the least this is usually
like three different columns.
Created that, updated that, deleted that.
For.
The different rows within our tables,
and if we can, we should even try to
implement something like versioning where
we actually keep the old data as well.
So instead of updating the existing
record, we create a new one with the
updated data and mark the old one
as deleted, but keep it in there.
So we have this lineage
and across this, we really, we shouldn't
treat data as perpetually valid.
So in today's conversation, we'll be
talking to Daniel Davis, who is working
a lot on temporal rack, how he calls it.
So basically how we can bring a time
dimension to retrieval, augmented
generation and knowledge graphs.
And he puts data into
three different buckets.
And each of those requires
different handling.
The first bucket is basically
observations, which are like
measurable and verifiable recordings.
So, it's for example, I'm wearing
a hat which reads Sunday Running
Club, and it might be like Daniel
notices, okay, the hat has a text on
it which reads Sunday Running Club.
This is an observation, but it
requires supporting evidence.
The second part, assertions, is
more of a subjective interpretation.
So, for example, he could say
like, the hat is greenish.
Um, which might be hard to verify if
you're alone only listening to it.
So you would have to basically
trust his assertion, but it always
requires confidence levels and
usually also update protocols.
Facts are immutable and
verified information.
And the immutable part
makes facts very rare.
And this is especially like you
need like a preservation protocol.
That the fact can persist
through time and isn't updated.
Otherwise, it would fall
in the other two buckets.
And these differentiations, we will go
deeper into that on how you can use them.
Basically, allow you to
distinguish a little bit
between static and dynamic data.
And you will realize that
most of your data is dynamic.
So you should treat it as such.
So you should add the time dimension
and monitor also the data freshness.
systematically to see whether data
has become stale and also review
and update your data regularly.
And when you get into the really advanced
space, I think what, what is really
interesting to me at the moment is,
for example, how can I make it that
The dependencies between two different
parts of the system, so for example,
the code and the documentation are
which both have different lineages
through time, how they are updated.
They also refer to each other because
if the code base changes, but my
documentation doesn't, I have a
break in the dependency and I can
only notice that if I take time
as an extra consideration into it.
And both of these
principles basically work
together, or both of these,
both of these frameworks.
So we should,
so good data isn't just about storage,
it's also about maintaining the
integrity and usefulness over time.
And today we will be looking
at How we can actually embrace
time as a core data dimension.
How we can differentiate between
observations, assertions, and facts.
But also how Daniel at Trustgraph
how they build their knowledge graph.
And how they basically favor simplicity
and modularity over like really
elaborate taxonomies and ontologies.
Let's do it.
Daniel Davis: Time doesn't really
factor into a lot of the ways that
things are done right now in terms
of architectures that we see in A.
I.
Frameworks platforms, whatever it
is that people are calling them and
Nicolay Gerold: people
Daniel Davis: a lot of opportunity
for time and also just in.
Information itself, time is one of those
factors that often gets overlooked.
I've seen that so much
in my career, though.
I spent a lot of time in the risk world
doing, you know, risk assessments for
really complex systems, and I've argued
for years time is really important
factor of a risk assessment because.
Changes over time, you know, once you
do that assessment, you kind of planted
a flag in the ground and said, that's
this is as of today, 6 months from
now, 10 months from now, 2 years from
now, this is probably not true anymore.
And there's not a single.
Risk framework that I'm aware of that
actually takes time into account.
There's a lot of there's
a lot of work like that.
In fact, that just doesn't
take time into account.
And that's something that.
We at TrustGraph are really
looking to do in the near future.
Nicolay Gerold: How do you
actually want to integrate it?
Maybe let's start first with in with
the issue of knowledge graphs that
time really isn't considered, like
where does it come from and why isn't
putting time as an attribute As a type
of relationship, why isn't that enough?
Daniel Davis: It's a pretty difficult
problem to solve because if you think of a
really large document, you have a 200 page
PDF, because I have one I've been working
with a lot lately for some an article I'm
actually writing to do some comparisons.
It doesn't really have much time data in
it except maybe on the very first page.
You have a date in which the document
was issued, but once you get into
page 364, and the topics on that page,
there's no time information anymore,
so you don't get any information on how
fresh is this knowledge, or this data,
or this information, this observation,
this assertion, you don't get that data,
so the only way to really handle that
is In a lot of cases right now is, is
manually with document metadata, as
you're doing an ingest, you can begin to,
well, I can just preview metadata on, is
there a date somewhere in the document?
And at the very least, I can say, we
ingested this document at this point.
And so you have that point in time.
So I think there's, when you look
back at historical information, there
aren't really good technological
Methods and solutions to extract that
information that might not be there,
but that's why you want to get this
information into a system going forward.
That is able to manage that time.
Because once you get the data into the
system, you have to maybe plant that flag
in the ground and say, okay, we don't want
to have a lot of temporal data previous.
But now we do so now we can begin
to say, you know, it's it's fresh
as a now what happens going forward?
Are we seeing new information?
We're seeing updates.
Are we seeing.
Information that conflicts with
what we already have, and then you'd
be able you're able to start to
calculate how fresh is this data?
Is the data stale?
Is the data static?
Does the does it never change
or does it change slightly?
But I think that's something we're just
going to have to accept is that there's.
There's only so much we can do with the
past and we just have to really focus
on, you know, what we can do with this
information going forward, you know,
there's, it's just, it's a big transition.
You know, there's only, we've
seen big technology transitions
like this in the past where,
you know, it's a paradigm shift.
Part of that paradigm shift is
there's the before time, and
then there's the after time.
used the word time a lot, didn't I?
Nicolay Gerold: Hehehehe Yeah, you did.
Can you maybe first define like
the three different terms you used.
An observation, a fact, and an assertion.
And how you would define those.
And also How do you automatically
differentiate between them?
Daniel Davis: Two of those, I think,
have pretty straightforward definitions.
One does not.
So, observations and assertions.
Pretty straightforward.
An observation is I think about it a bit
like in the quantum mechanical sense in
that you don't really know the state of
a system until you make that observation.
So an observation would be you're
wearing a hat that says, I think it
says is to say Sunday running club.
Is that what it says?
Nicolay Gerold: Yeah.
Daniel Davis: Okay.
So that's that's an observation.
And an assertion would
be, I think it's green.
Which I think on my screen,
it looks kind of greenish.
So the distinction between those
as an observation is a recording
of something that I can measure.
Where an assertion is more of a
human opinion, you know, a human
statement that my opinion about
a particular time situation that
doesn't have a supporting observation.
So, that's the way I look at those 2.
An observation has to be
supported by some other piece of
information, whereas an assertion.
It's probably what is not now.
Fact.
Oh, boy.
This is something that even my, my co
founder and I we don't exactly agree
on the definition of this, but we're
working towards 1 that we mutually
agree on because it's, it's something
that can get so complicated so quickly.
And I find it so concerning that.
The majority of people today think
that's a really simple definition.
They think facts are these
black and white things.
And if you talk to anybody that's
worked with really complex information.
Worked in the intelligence world, for
instance, they will tell you that there's
no such thing as black and white fact.
Everything is in this gray area.
So I started giving this a lot of
thought, you know, what is a fact where
we can start thinking about facts as a
observation that is supported with an
evidence or some sort of information.
Okay, that's that's a
starting point, but Okay.
You know, what happens when it changes?
And we see this all the time.
You know, I think about the L.
A.
You know, the California wildfires in L.
A.
There's a lot of information
coming out about that.
People are making a lot of observations
and they're not necessarily
trying to be malicious, but.
The information changes over
time as we get more information.
Some of the observations may not have
been completely accurate, or there
might not have been fully detailed.
There might have been omissions.
There might have been disambiguation.
There might have been misunderstandings.
So how do you.
What do you consider a
fact in that situation?
I think that's a really good
example of the LA wildfires.
You know, what do we consider facts in
that very rapidly changing situation?
And that's where time comes in.
I started thinking about this and I went,
well, a fact is going to be something
that no matter how many times we talk
about it, it's never going to change.
It's static.
It's like geological bedrock.
You know, when you talk about
archaeologists or geologists and they
dig and they hit bedrock and they stop.
Why?
Because you've hit bedrock.
There's nowhere else to go.
You can't dig any farther.
And I think that's, I don't want
to say the only way, but I think
that's a way that we can look at it.
Act in a more granular way that
we can work with going forward in
time again, we have to use the word
time a lot is that it's something
that's never going to change.
Like, for instance, my observation
that you're wearing a hat
that says Sunday running club.
If I say during our podcast
episode, you're wearing a green
hat that says Sunday running club.
Well, that's never going to change.
So that's a fact.
Now, if I say your hat
well, what can I say?
Like if I made some sort of assertion
about how I think you're feeling right
now, though, you know, that could,
you know, that's an assertion and
also you may tell me later on, like,
well, I didn't really feel that way.
So that's an assertion
that may change over time.
Observation that may change over time.
But the fact, well, here, there I
go, see, we have all these words that
we can't, we struggle to describe
these situations without using the
word that we're trying to define.
So the observation that you're
wearing a green hat during our
recording, this says Sunday running
club, that's never going to change.
So that's static.
So, in our thinking, we would
classify that as a fact.
Now, you can get much more
granular, much more detailed.
You start getting degrees of
fact, degrees of truth, which.
I'm a little bit, frankly, the word
is scared to go down that road.
I want to create a relatively simple
system, at least at first, and I
think people can understand because
I think people can understand that.
I think people can understand
this piece of information.
It's never going to change.
It's etched in stone.
It always was true.
It always will be true.
And when you start thinking
about it that way, you realize
there aren't a lot of facts.
Most things can change.
And that's why fact The world of
fact, the world of information,
disinformation is so difficult to
understand because there is so little
information that is truly etched in stone.
Everything else is dynamic.
And when we start to.
It's except that except that information
is dynamic and keep track of it,
you know, keep track of how it's
changed over time and keep track of
how often did it change over time?
Is it changing over time?
Rapidly during this
period and not over here.
That gives us a lot of information
to be able to determine what do we.
Trust this information and, you know,
that's a lot of words that have ambiguous
controversial and conflicting definitions.
And I think that with AI and AI systems
and how we're trying to use those and.
The availability of information now,
these are imperative problems to solve.
Nicolay Gerold: And then for the rest
especially for the assertions are.
There are other factors coming into
play like consensus, the confidence
or the authority of the source.
Do you handle that on top of it as
well or do you want to handle it?
Okay.
Okay.
Daniel Davis: I'm going to, I'm going
to give the politician answer and waffle
and say, I don't know, maybe I have
well, okay, let me, let me talk
through this for a 2nd, because my
mind is racing because we can go in
this and so many different directions,
but the value actually on both sides
from a system designer.
In a trustworthiness perspective, I
think the analytics within the system is
kind of where we're going to ultimately
land that on being the source of truth.
And here we are.
We're using words again
that we're trying to define.
However, knowing.
The confidence of your assertion could
be particularly useful in the system
because then all of a sudden we can
say, you know, you made an assertion
about what was happening in the
palisades on the during the wildfires
and you said it with high confidence.
And now we've got information
that conflicts with that.
So, we might say, well, wait
a minute, maybe this person is
a source or this new source.
You know, that affects their
trustworthiness for, you know,
they said they were really
confident about this information.
And now we've got conflicting information
and you can start to establish trends.
You know, how often does this happen?
Does this news source continually give
us information, give us assertions with
high confidence and that information
turns out to be, you know, there's
conflicts and there's changes.
So That is useful information for
the system to have, but I wouldn't,
okay,
that's a hard question.
These, these, these can go
in lots of different ways.
See, you're smiling too.
So I'm going to make an assertion.
You're enjoying this.
You're enjoying seeing me squirm.
I think,
you remember how I talked about there's
a before time and there's an after time.
I think the answer to this is we
have to take a Bayesian approach.
That first observation, you have
to take it for what it's you
have to take it at face value.
If we, if a news, if a source, and
again, and, you know, if we're using
the news, for instance, if the news.
makes an assertion with high confidence.
We have to take them
out there at face value.
Okay, this is a highly
confident assertion.
And then as the information gets into
the system and we get more information,
we get updates that information.
We can begin to go back and revisit.
Well, was that was that correct?
Or is that fair to make that in
that assertion with high confidence?
But at the same time, we actually
will need more information on
that particular topic to even
be able to come back to that.
So
Nicolay Gerold: Yeah,
Daniel Davis: with data.
The very, very 1st.
Step is not the same as all the
steps that come after it, because
there is that naive aspect of you're
trying to understand information.
You're trying to understand a
document and you have the I'm
getting it into the system.
So I don't know a lot about it,
but once it's in the system.
I do know a lot about it.
And, of course, you know, Bayesian
methods are kind of the mathematical
way we deal with these problems.
And that's something that's also, you
know, we've talked about for a long time
as being able to use more Bayesian models
to try to understand these kind of topics.
Nicolay Gerold: maybe To nail that a
little bit, Tom, this isn't just a news
thing, but also in a prior episode, I
talked to Max on the documentation in
Google, and that there are often outdated
components or things, like, where one
part of the system was updated and the
documentation with it, but a dependency
system, the documentation wasn't updated.
So you really have this temporal aspect.
One thing changed and the other thing
didn't and with a temporal component
and a component that actually maps that
dependency you could make that obvious
that it doesn't fit anymore it's something
is outdated but in like the current
scheme of things that you just have
basically i embedded once and i use it
forever this just doesn't work and the
same for what we recently worked on is
And they basically have documentation
on all of the different parts and all
of the different systems and often
different components are swapped
out, so the machine doesn't really
fit with documentation anymore.
So this is also something
you have to consider.
And the human part is mostly
the biggest point of failure.
Because when the documentation
isn't updated, you can't really
update it in your system.
Daniel Davis: Oh, that that resonates
on so many different levels, including
our own Trustcraft documentation.
I think I told you that, uh, we
have, there's lots of good intentions
about keeping it updated, but.
Pretty much exactly what you said.
You have these dependencies that
you've maybe forgotten about, or
it's not immediately obvious, and
you don't realize that's there.
Knowledge graphs are a great solution
for that problem, and that you see
those relationships so that, you
know, once you start to track things
temporally, you can say, you can look
at, like, when was this last updated?
And, oh, these 2 topics are
related, and this 1 hasn't been
updated for a very, very long time.
Maybe it should be updated.
But, you know, so many, so many things
come to mind when you talk about, you
know, how information evolves over time
and understanding that and don't really
have a good answer for why we don't do
this more now because, you know, the
world I came from in a past life and
the aerospace world, you know, time was
really, really critical and there was
a lot of effort that went into keeping
track of this kind of information.
Now, It wasn't done with
a technology solution.
It was mostly done by people, but
recording it sometimes in spreadsheets.
I wish that wasn't the answer, but it
really, really was because there was a
recognition that these issues are really,
really important is understanding we.
This system failed, and it's
related to all these other systems,
and we haven't seen anything
with this system in a long time.
Maybe we should, you know, check it out.
Maybe there's something
going wrong with it.
Or somebody mentioned to us recently
an interesting use case for what
we're doing aircraft maintenance.
I know a lot about that 1 and
that's actually a really big.
You train an aircraft almost like a person
with their health record, and you look
at their maintenance record, and time is
a huge aspect of it because you have the
Well, here are the things that you do 1st.
And do you want to see you want to check
that you want to see that checklist
and it can be really important of,
you know, what's the order of thing?
What's the order you did this in?
Did you did you check that thing
that we tell you to do 1st 1st?
Because if you didn't, we probably
need to go back to check that.
Or, you know, knowing
the relationships that.
Depending on what system you checked,
well, maybe we don't need to do that
because there's a sequencing order here.
You know, the sequence of events,
the, the freshness of information,
the staleness, because, you know,
that's another aspect of it.
You know, I remember there
was 1 particular aircraft.
I still remember this aircraft really,
really well because I had to travel to
go see it personally because it was.
Sick.
You know, we had, you know, months
and months and months of records,
and at some point, it's useful to
know what they did eight months ago.
But that was eight months ago.
How many times have they flown
this airplane since then?
It's not that it's, it's useful,
but you have to prioritize the
most current information more
so than the past information.
But, but you have to also correlate it.
Is there information in the
old information that's not
in the current information?
And again, that's a really important
distinction that You need that time aspect
when you're ingesting all this information
to be able to make those distinctions.
And as you said, with something like
vector embeddings and semantic similarity,
this in situ context is totally lost.
And even with knowledge graph
relationships, this is not something
that's going to be naturally
in a knowledge graph either.
It has to be manually added
and there has to be a in situ.
Schema or an approach on how
you're going to do that, whether
you're going to pin the metadata to
nodes, properties, relationships.
Again, that's.
Time is not if you look at
all of the schemas out there,
ontologies and taxonomy.
So, like, the big ones, like schema
dot org or, you know, owl or D.
V.
P.
D.
A.
once again, it focuses on
entity entity resolution.
You don't really see.
A lot about time.
And if you actually look at the big
knowledge graph database systems,
they don't give you a lot of tools
for dealing with time either.
That's something that you
kind of have to add manually.
Nicolay Gerold: And how do
you see integrating the time
aspect into the knowledge graph?
Do you see it still as metadata that's
attached to nodes and relationships?
Or do you see it integrated
in a different way?
Daniel Davis: I think the
bulk of it is metadata.
I think the, the difference would be
say, you have like the document that
I've been working on for this article.
It's a, it's an old aerospace
airworthiness document
called mill handbook 516.
C.
if you're looking for something to
put you at sleep at night to cure your
insomnia, it won't take many pages of it.
And it was my, it was my
life for 1 of my jobs.
1 time was really understanding
the full breadth of it.
And, of course, it's going to have a date.
It's the very beginning.
It's, you know, when it was
published, when it went to effect.
And so I think that would be a
relationship in the graph that this
document called Mill Handbook 516.
Was published on, and you would have
a literal that would be the date.
So that wouldn't be metadata.
That would be an actual
relationship in the graph.
But once when we ingested and process the
document, that would be metadata and then.
When new data or new updates came in that
affected topics of nodes or relationships
in the graph, that would be new metadata.
So I think the bulk of it, metadata some
exceptions though, or for instance the
date that you, you know, you publish
this podcast, you know, that would be a
relationship podcast was published on,
and then you would have that date or news,
you know, this news was published on.
You know, X date and and that
you would actually keep that going
forward to even if there's new
updates, you would probably maintain
that because that would be useful.
But, you know, every time all the
processing you're doing in the system.
That would be metadata.
But metadata is really where
the intelligence process lives.
On processing raw data is all that
metadata you append to it, and
understanding the relationships between
the metadata and all the insights
and analytics and all the connections
that come from that data set.
Nicolay Gerold: I'm really torn how to
differentiate between these two, like,
when is it actually a relationship?
And when is it an attribute?
How do you like put your finger down?
Okay, in this case, it's a relationship.
In this case, it's just an
attribute, which I put on the node.
Daniel Davis: I think the simple
answer is we're just gonna
have to make a definition.
And see how it works out, I think this
is going to be a bit of an empirical
process of this is our approach.
We think this is the
best way to approach it.
And again, you know, Bayesian
you know, we make our.
You know, we make our assertion.
I mean, again, that's
the Bayesian process.
You make your Bayesian prior.
This is the system.
And you go forward and you test.
Was this a good assertion?
Was this a good definition?
And I'm fairly confident
it will change over time.
And, you know, philosophically, again
That's why this entire area is so complex,
is because you just quickly pointed
out a simple area of the system that we
can't really define in a black and white
way that everybody's going to agree on.
Already, already you figured that out.
And, you know, that's part of
my core thesis is that none of
this stuff is black and white.
Facts are not black and white
until you get something that
just those rare statements, those
rare pieces of information that
always were and always will be.
And those are pretty rare.
Everything else is somewhere in between.
And when you have things in between,
you get 100 people, you're going to
get a lot of different, you're going
to get a lot of different approaches.
And then you have varying
degrees of value out of those.
Nobody's going to be.
Why don't I say nobody?
See, again, those are absolutes.
People are going to be varying
degrees of right or wrong.
Again, binary choices.
There's going to be degrees of value
that come from these approaches.
And again, from a, from a business
perspective, from a system design
perspective, the goal is to drive
the most amount of value at The
least amount of cost to the user.
So that's going to be
where the trade off is.
And I think that's I think that's why we
kind of have to simplify a lot of this
because it can get so complicated so fast.
I had somebody asked me the other
day about you worry about duplicates
in your in your knowledge graphs.
And I went, there's a lot of people
to get really, really wrapped
around the axle on duplication.
And does it really impact the outputs and.
Oftentimes it doesn't.
I mean, if you look at any data store,
especially for any large company or or
customer data set, it's gonna be mountains
and mountains and mountains and mountains
of duplication and you can work with it.
So, how much time do you want to
spend on deduplication when your query
tools and your ability to access that?
data is allows you to get the
value that you need out of it.
So, you know, those are part
of the system design choices
of, yeah, this isn't perfect.
Yeah, there are duplicates,
but does it affect the outputs
we're getting in the system?
And how much time are we going to have
to spend To do this deduplication and I
think that's also that's a very similar
topic to what you're talking about here.
You can very go very, very easily go
down this rabbit hole of how to prevent
deduplication and spend huge amounts of
time and effort and to trying to do it.
And you go, what did we get out of it?
Nicolay Gerold: Yeah, and I think it
will depend in the end what do you put
as notes, what are the relationships,
what entities are you using?
How are you integrating
your temporal data?
It's like how, like what queries
do you actually need to support?
Because what are in the end the
query patterns of the different
users and this will inform in the
end like what relationships what
entities and what metadata do I need
Daniel Davis: You know, querying is
an interesting topic, but in and of
itself, too, because we find, just
take knowledge graphs, for instance.
You can do really sophisticated query
techniques with knowledge graphs.
You know, you can start finding, you
can, you can start querying on, I want to
know the shortest path between two nodes.
I want to know the longest
path between two nodes.
I want to know this kind of path.
I want to be able to do this many of hops.
You can find there's some really
sophisticated graph query techniques
out there, and I don't think
many people actually use them.
And I say that with, that's an assertion
with relatively high confidence
for me because I, I know so many
people whose life is selling graph
databases and you talk to them, you
know, what are your users saying?
And we find very few people use those
really sophisticated techniques.
It's more of, you know, I have a topic
and I want to get the, the first hop of
sub graph that's related to that topic.
Now, what's interesting is, not only do
people not really use those sophisticated
techniques much going forward, and I'm
sure, I'm already envisioning now lots of
comments from people telling me I'm wrong
on this, but we're about to do a, once
again, we're about to revisit the theme
of the before times and the after times.
I see use cases for those
really sophisticated techniques
with complex ontologies.
If you have this really complex ontology
and you know, the data is in this
structure, then there are use cases where
you might want to say, is this piece
of information, is it connected in any
way to this other piece of information?
And you might want to use some really
sophisticated techniques for that.
However, with AI, with language models
that we're now using, we extract and build
our knowledge graphs to be very, very
simple and very, very flat structure.
And the reason why is because I guess
I guess this is also something that I
should mention is that, you know, we say
graph rag, but we don't do pure graph rag.
You know, if the graph rag definition
is you're using an LLM to generate
cipher or GQL or sparkle queries.
That's not what we do.
We actually create mapped vector
embeddings, and then we use a semantic
similarity search on the query or the
request to say, hey, what are the,
what are the potential nodes that are
interested here, rank those, and then
generate subgraphs off of those with, you
know, pretty standard that there's like
an eight query per mutation that gets all
the relationships to build a subgraph.
So, you know, that's the way
that, that we approach it.
And if you approach it that way,
All of a sudden, do you really
need these complex ontologies?
You know, do I need to know,
like, if you look at schema.
org and all the, the depth that they
go to try to classify websites for
SEO, you know, do, do you need all
these, you know, product categories
and all these categories anymore?
Because we now have a tool with
semantic similarity search to know this
is the topic that we're looking for.
And if we can then cop correlate
that topic to all the other relation
information that's related to it
that's in the graph, do we need
these complicated ontologies and
taxonomies or schemas anymore?
And I would argue, no.
You can still ingest them in matter
of fact, 1 of our users is actually
working with sticks, which is a very
complex ontology and cybersecurity
world for threat intelligence.
But I think what we're seeing
is that we now have tools.
That don't require these really complex
ontologies and schemas that we can work
with data more at the relationship level.
And because of that, I, I think a lot
of these really, really complex query
algorithms are going to see even less use.
And I could be wrong.
Somebody could find some really, really,
really niche use case for some of
those that nobody's ever thought of.
But I don't see it at
the moment, which also.
Brings me to something that I also
kind of think, and that's Do you
really need a knowledge graph database?
That's where that's a really,
really interesting question,
because you can store a knowledge
graph in columnar tables.
There are ways to do that.
It's not that hard.
A matter of fact, that's what we
do with our Cassandra deployment.
You know, Cassandra is not a knowledge
graph database, so we're actually
writing knowledge graph triples into
sets of, you know, tables in Cassandra.
I mean, we also support, you know,
Memgraph and FalkerDB and Neo4j and
some others that we're working on
integrating, which are more traditional
knowledge graph databases, but you don't
even have to have a knowledge graph
database system to store these kind of
relationships and then extract them.
And again, it kind of goes back to the.
Maximize value from the system
while minimizing the cost that it
takes to design it and operate it.
And yeah, it's, it's, it's fun to
think about complex ontologies.
Is it, is it really,
is it really necessary?
And that's something that my co
founder Mark actually talks about.
I've asked him this question before
about why do people even Thank you.
Why do people develop these incredibly
complex ontologies and these
schemas when there are ways that
you can query the system without it?
And I think it's a little bit
of people just kind of like it.
You know, it's, it's something
that people found interesting.
It was, you know, you give people
a tool, they, oh, I can do all
these other things with it.
And nobody ever stopped them
and said, Well, do you need to
do you need to do those things?
Do we actually need that tool?
Or is the hammer fine?
And I think in this case, we're finding
that, um, pretty simple queries are
suit most people's needs and we can
now do it in a very sophisticated way.
Nicolay Gerold: I'm a little bit torn
in Because I think we get into a space
with LLMs that it's become way easier to
actually construct knowledge graphs But
when we use LLMs typically it's like very
free So any entities, any relationships
are extracted and inserted into the graph.
And because it has been trained on the
different schemas as well, I think LLMs
tend to have extraction entities and
relationships that are similar to schema.
org and the other big
taxonomies, ontologies.
Versus I think, more of a simple
homegrown solution where you focus
on A few core entities a few core
relationships which actually matter for
your system And you restrict it to those.
I would actually like to know Which of
the two sides are you leaning towards
a really open knowledge graph, which is
like ever expanding Or like more of the
constrained one, which is really tailor
made for the system We are now going
into like the black and white thing again
Daniel Davis: well, I would say to
address the 1st point you made about
LLMs, 1 of the things that we've found
consistently is that different LLMs
approach this problem very differently.
You're very right in that regard.
For instance, the llama family of models.
Seems to focus more on people, even if
you, if you ask it for entities, you're
probably going to get people and it's
kind of hard to get it off of that.
Actually, there's a lot of models that
if you ask it for entities and conceptual
topics, the list will be exactly the same.
It's actually a pretty small
number of models that we found
that can make that distinction.
So you know, I completely
agree with that regard.
Is that.
There's no perfect extraction
with an LLM right now.
There's no question about that.
And it can also be a little bit specific
on the knowledge because you're kind
of relying on, is this information in.
the training set for this model.
And here's a, here's a
good for instance on that.
Most LLMs that we've dealt
with do pretty well on RD.
Well, they do pretty well on Cypher
for sure, because there's a lot
of stuff about Cypher out there
on querying knowledge graphs.
So they do pretty good on Cypher.
RDF on the other hand which,
and you mentioned schema.
org, which actually is kind
of, which is based on RDF.
RDF, not so much.
The most common format that people
work with with RDF is turtle,
which is the most human readable.
And we found LLMs can't, we have
not found an LLM that can really
consistently output turtle.
And the reason why is if you just
do a Google search on RDF turtle,
you'll find blog posts where the
explanations are just flat out wrong.
And so you have, so Then you're
thinking about like, well, so how
think about the training process.
So garbage in garbage out.
So garbage went into the system.
So how do you fix that?
Well, somebody on the human, whatever
calling all these processes on the
back end where humans are manually
reviewing the data, whatever buzzword
people want to throw at those.
Somebody would actually have to be able
to look at that and know enough about
RDF Turtle to realize that this is wrong.
And I'm going to wager a decent amount
of money that there is nobody on the post
training side of companies like OpenAI and
Anthropic and Google who know enough about
RDF Turtle to realize this isn't right.
So you Right there, we can see these
models are not gonna be able to
do this consistently, and they're
not, and to answer your question,
our approach is we're trying to
take as broad approach as possible.
You know, we don't want to create
these very, very, very narrow,
complex ontologies and schemas.
You know, we've tried to address, we've
tried to develop trust guys from day
one to be as flexible, as modular, as
open as possible so people can take.
You know, these components and very
quickly tailor them to their needs.
And one of the things that we focus on is
what we call, you know, cognitive cores.
And which is basically a modular component
where we've done this extraction.
For instance, you know, Mill Handbook
516C, I've been playing around with that.
You know, if you go to our GitHub, we have
a cognitive core that you can download
and reload into the system in minutes
that we've already done the extraction on.
So you don't need to do that extraction.
So.
You know, part of our vision is that
you would be able to very quickly,
you know, you could share a cognitive
core you've made with me, and I could
share one that you made with you.
There can be a store where
we can download these.
You could upload ones that you've made.
You know, you could take your entire
podcast series, and you could create
cores from the transcripts and upload
those and people could use them.
And again, though, it's not perfect.
It's not black and white.
Even the cores that we've made for this
particular document that I keep talking
about, there are some relationships
that aren't necessarily connected.
That if you're able to really dig
into how it's stored in the graph, the
information is there, but sometimes
there are some missing links.
And, you know, that's why in, you
know, our kind of future roadmap does,
you know, working on tools that would
be able to give people that ability
to, you know, make edits going on.
So again, temporally.
Not only would there be the automated
side of that, where you're ingesting
new information, and if the information
supports the old information or changes
it somewhat or conflicts with it, making
those observations and recording all
that and doing all those calculations,
but also where the human can say.
No, I manually added this, or I did
this because this is my take on this.
And again, underlying though, there would
be metadata associated with all of that.
You know, is this an automated process?
Was this a human process?
Who made this who made this edit?
Who made this change recording
all that information?
And as you go over time, that enables you
to, I hate to use the word calculate, but
be able to come to a better understanding
of What sources, what knowledge
relationships you can trust in the system.
And again, that's something we could have
spent an entire, we could have spent days
talking about how much myself and my co
founder have spent in our lives, trying
to create metrics to calculate trust.
And it's not just us, many people have.
And I haven't seen one that has
really caught on because they
all have lots of pros and cons.
And the cons often.
They, they often hurt the pros so much
that it's difficult to use them, which
again, we're getting back to, you know,
what's what's the most amount of value
we can get from these, these systems, you
know, at the least amount of cost and.
And this kind of, I kind of, you know,
and I kind of think about in the Moby
Dick perspective, you know, the white
whale, some of these are white whales.
They people spend their entire
careers chasing after these topics.
You know, what's trust in a number or
what's risk in a number and they dedicate
decades to it and they do good work, but.
They never quite get there and I think,
you know, that's really important
to realize is, you know, how many
of these topics are white whales?
And I think there's a lot of
white whales in AI right now.
People are chasing.
These topics, which you're never
going to get there and we need to.
We're much better off accepting that
now and saying, you know, we're never
going to get to this point of perfection.
So what are approaches that.
Are usable, or how do we begin to
build on top of these processes
and approaches to get the necessary
amount of value out of it commensurate
with what we're putting into it.
Nicolay Gerold: I think that's
actually also one of my biggest pet
peeves in the space that tools always
just continue adding on features
until they are basically useless.
think we are lacking stuff in this space.
Open source project, but also
software products where they just
say, this is the feature set.
We do one thing really good.
We now have a feature set.
Now it's only maintenance
and we don't go beyond that.
And I think there are a lot of tools who
just they are like, they have found it.
They are really great solutions.
And now they just start adding like
useless stuff for additional use cases.
And I think we need more tools to just
say okay, this is what we are doing.
We are really great at that.
And we are not going beyond that.
Daniel Davis: Well, that's kind of a
bit of why we haven't worked on lots
of lots of connectors for TrustGraph.
So if you look at some of the other
frameworks out there, agent frameworks,
different frameworks they have lots of
connectors for all the back end systems
that your enterprise might be using.
You know, they have.
If you look at, they have a page
that you can scroll and scroll
and scroll, and that could be 50
different connectors, even more.
And that's 1 of those situations
that once you go down that road,
it's exactly what you said.
You will, you will spend so
much time maintaining all those
connectors because every time.
Those dependencies now update their
system, or they update their API, or
they make a change or add a new feature.
Now, that's where all your time is
going is maintaining those connections.
And so, to your point, are you really
focusing on the thing that you do?
Well, and that's where we've been trying
to focus is the things that we do well,
and not focusing on all those other
connectors, but I think there's also a
little bit of a sometimes an unrealistic
expectations from customers that.
The product should do everything
and they don't really.
Oh, for this to do everything,
it's probably not going to do
anything particularly well.
So it's, it's one of those.
I, I, I, I think it's interesting when,
you know, when I was growing up, you
know, people had Swiss army knives.
Do you remember those?
Because I don't ever see people
talk about those anymore.
Like, it was a big deal to
have a Swiss army knife and.
Nicolay Gerold: have them.
Daniel Davis: But do you ever use it?
Nicolay Gerold: Yeah, because
I often lose stuff like,
Daniel Davis: Okay, okay,
Nicolay Gerold: bottle
openers and stuff like that.
Yeah.
Daniel Davis: perspective.
Oh, wow.
It's a dream because
look, it does everything.
It does this.
It doesn't this.
You can make this, you know, amazing
content, but all the things it does.
But if you try to use it, you
realize that, yeah, you could
use it to open a bottle of wine.
But if you had an actual wine
opener, yeah, You're going to
choose every single time to use
that because it's the better tool.
It's, it's, it's better because
it's designed specifically for that.
So you do get these, you know, one side,
you get these, you know, tools that we
do it all, but do you really do it well?
And wouldn't you be better off just
doing, you know, taking the one that
does the one thing really, really well?
Nicolay Gerold: Yeah, and I think that
really leads into another topic we
talked about before, the API meshes.
That you said that a lot of AI is
actually just connecting different
APIs and actually calling them.
How does that actually fit
together with that view that
tools should be specialized?
Because when I use specialized
tools, I tend to have more of
like different modules, APIs,
whatever that I have to call.
for
Daniel Davis: is that so
many of the frameworks.
Or even platforms that we see out
there is when you really dive into
the code, you know, that's all they
are is that it's a it's glue code, you
know, because what is glue code glue
code is, you know, connecting, you
know, 2 different services together.
And.
When you have everything is
an API service, and you just
begin to connect them together.
You know, yes, you've, you've
kind of built this really
complicated mesh network in a way.
But then how does that network.
Operate on its own.
So you've kind of created this
mesh network with all these API
connectors or even even more linear
systems, more point to point.
So what's controlling all that
network flow underneath it?
And if you're not worried about
performance, if you're not worried
about having lots and lots of users
and lots of data, maybe you don't care.
And I think, you know, that's one of the
things I've seen in AI is so much of it.
What you see is demo where so much of
it came out of a hackathon, something
that people could build over a weekend,
something that came out of a, you know,
a notebook, a collab notebook that
people could really quickly get their
hands on and, you know, within an hour,
say, like, oh, this does this does X.
And it's really cool.
What does it take to take it beyond X?
Can it go beyond X?
You know, how do you take this to
having millions of concurrent users and
having petabytes of data underneath it?
And then what is the user experience?
You know, when you create these You
know, these single these mesh networks
with no control underneath it, you
create these bottlenecks of, you
know, somebody's going to be waiting.
I think about a state machine analogy
with an elevator in which it has to
transition from, you know, I found the
4th floor and I'm going to the 12th floor.
If somebody else is trying to
go down, it's not going to stop.
It's going to take me to the 12th floor
and then come back because that's just
the rules of how the states transition.
And, you know, that's something
that we just accept for an elevator.
That's just the way it works.
But customers for software products
don't tend to have that patience when
they're waiting because, oh, somebody
else's request are going through and
you just have to wait because that's
the way the system has been designed.
And to get your question I think
we have to there needs to be
more of an understanding of what
software infrastructure really
is and why it's important.
And there've been a couple of things
that I found shocking over the last year.
And because, you know, I
used to go to a lot of.
AI events here in SF.
I haven't had as much
time to do that lately.
And when you say words like
infrastructure so many developers now
look at you like, what do you mean?
Do you mean like AWS?
What do you mean?
Like, what do you mean infrastructure?
What does that what does that even mean?
And.
You know, TrustGraph, you know,
uses Apache Pulsar as our backbone.
And I'll say things like, you know,
we're using a PubSub backbone.
And this is, this is no joke.
I'm not kidding on this
because I actually count.
In the last year, I've met fewer than
five people that knew what PubSub was.
And this is Silicon Valley
engineers who are building an AI.
That is no joke.
That is no exaggeration.
Because I count.
And they tend to be actually people.
From some of the people that I've
met that actually knew that weren't
that actually were visiting San
Francisco from somewhere else.
And I, I'm just kind of,
I'm, I'm just flabbergasted.
I'm dumbfounded.
And I think that's because.
And, you know, I'm trying to not
to derail us here because I'm,
I'm about to spin up on a topic.
I could just just keep
on zooming on is that.
We've over the last decade or so,
we've decomposed software down
to these, the smallest component
possible to get it into a JIRA task.
And that's so many software engineers
lives now is they work on a very
particular set of features, maybe
one feature in this one service and
this one overall stack, and they
don't really touch anything else.
And so it's not that they're not.
They don't understand these topics.
It's not that they're
not technical enough.
It's just that they don't
ever see any of this stuff.
They don't ever see it.
They don't ever think about it.
And if you spend too long in these worlds
and these very, very granular topics,
you, you kind of lose the ability.
To see the big picture, because
you just haven't been, you just
haven't flexed that muscle.
You lose that ability to
think like, oh, wait a minute.
How does this affect, you know,
the site reliability team?
Like, what do those people even do?
Like, what is, what's the
stack underneath this?
You know, how.
You know, how does my service
relate to other services?
You know, what are they
actually running on?
How do we deploy these services?
How do we do, you know, how do we
have backups for these services?
How do we have failover?
How do we do any of these things?
And there are so, so, so few people
that think about those problems.
And.
Oftentimes companies don't invest
in it because it's not a feature.
It's not a, it's not something
that they can directly market
and sell to the customer.
So you only see it in people that have
worked in environments that have you
know, high reliability requirements.
So either, you know, highly
reliable requirements, high
availability you can't lose data.
You know, data loss is
really, really critical.
So unless you happen to work on
software that touch from those
systems, you're going to lose it.
You just kind of forgotten about these
issues or haven't even thought about them.
And, you know, I think we've got to
start thinking about these problems
at a more holistic level, you know, a
bigger picture, the systems level of how
these systems integrate to each other.
Because if you build a
really, really reliable.
Flexible, modular system where you can
plug tools in, then you can build the
best tool possible and it plugs right in.
And the thing is, this
stuff already exists.
It's out there.
You know, that's why
we chose Apache Pulsar.
Apache Pulsar is fantastic.
I mean, it's got all of
the nines and reliability.
It's got all the
modularity and flexibility.
Well, the way we use it in TrustGraph,
you can very easily take a software
module and hook it in to Pulsar, and all
you gotta do is, you either publish the
queue, I mean, you define your queues.
If, if you're trying to subscribe
to a queue that's already there,
you just subscribe to that queue.
If you want to create a new one, you just
create a queue and start publishing to it.
It's crazy simple how simple it is to
scale up on these services so that you
can focus on building this one tool
that's great at what it does, but you
have to have that foundation there.
And that's what I see totally lacking
in so many of these frameworks that just
focused on the higher levels of the stack.
You know, they focused on You know,
if you're thinking about a building,
you know, a high rise, they focused
on building out all the floors 1st,
you know, all the office space and all
the bathrooms and all the kitchens.
And they're great, except
that what's underneath it.
There's nothing here.
There's nothing well, what happens
when we try to scale it up?
It collapses and, and that's kind of
one of those big picture topics that
honestly, that's why, you know, I've been
doing more stuff like this, you know you
know, why we started doing our own, you
know, podcast series where we interview
people and why we've been more vocal is
to try to shift the narrative in this
space to say, like, you know, time out.
Yeah, this stuff, the stuff you guys
are doing is really, really cool.
But.
You're going to hit a brick wall.
Actually, you're not
going to hit a brick wall.
You're going to hit an impenetrable
wall that you can't tunnel through.
You can't tunnel underneath
and you can't climb it.
You're just going to be stuck.
And the only way to get around that
wall is going to be to turn around
and go in a different direction.
Nicolay Gerold: Yeah, I think it's
partly also there aren't many people
who actually get to see the problem
and come up with the requirements for
a solution what do I actually optimize
But the large majority of developers
actually sit at the end of the process.
They get told what do we have to achieve?
Or like rather the solution which
they have to program and aren't
really part of the decision making,
so they really can't learn it.
And one project I think which is really
interesting to study in that domain
is, for example, Tiger Beetle, which
is an open source database tailor
made for financial transactions.
And I think this is such an interesting
area because you can really see when
you go through the docs in their blog
posts like how they're optimizing for
that specific problem and they really
go from problem to architecture to like
the specific implementation, which is
really interesting to read through.
Daniel Davis: No, I completely agree
and this is what happens when you design
companies where you want to decompose.
Essentially people into replaceable
cogs in the machine and, you know,
that's kind of what we've ended up with.
Sweezes that you know, to
big tech companies, software
engineers aren't people.
They're, they're 5 years experience
with Python and this particular
stack for this particular use case.
And we've decomposed it such
that if This employee leaves,
we can just slide somebody
straight in and there's no change.
We don't see any change in our
productivity, but you do that long enough.
You get people with 10 years experience,
15 years experience that, as you say,
they haven't touched other parts of
the system, and it's not their fault.
It's not because they're not interested,
or they're not capable, it's just, in a
lot of cases, they're not even allowed,
or they don't even know that stuff exists.
You know, I've seen that time
and time again in tech companies
where there are teams doing work
that you didn't even know about.
I had times in my aerospace career where I
found out that There was a row of people,
you know, two cubicle rows over doing the
same thing I was and we didn't know about
each other or like there's a team across
the hall that are doing the same thing
for a completely different set of people
just found out about it by accident.
And you go, shouldn't
we be working together?
Yeah, you probably should.
But we're not gonna you go, wasn't
that going to cause problems?
No, aren't we going to like,
step on each other's toes?
You know, something we may do might
break something they're doing.
Yeah, and that's and that happens a
lot, you know, when people are doing
all this work in isolation, but,
you know, that's, I mean, you know,
all of a sudden, you know, we're
now to, you know, this is a rant
against, you know, company capitalism,
but, you know, when you prioritize
decomposing these tasks to that level.
You know, that's part of the
problem that you're going to see
and it's also short term thinking.
No, I've seen that change a lot,
you know, through my career.
I'm a bit older than I look to
maybe my benefit or detriment.
I don't know.
And.
You know, I think about how engineering
was approached when I first started
working and it's radically different.
Now, I remember you know, I started as a
junior engineer and you had, you know, a
senior engineer and a program manager and
part of the senior engineers job was to.
Be the tech lead, but was also to
mentor the junior engineers that
was part of their job that was
understood to be part of the job.
That's why you have senior engineers
and junior engineers on the team.
So that the junior engineers
get experience and they learn
that is something I haven't
seen for well over a decade.
Now is this structure where.
We recognize that people need to learn
and grow and be exposed to new topics.
Instead, we've chosen to value people
that are the absolute expert in
their very, very, very narrow niche.
And then we expect them to be able to do
all these other tasks outside of that.
Because, like, well,
they're the best at this.
Yeah, they're the best at that.
Very, very, very narrow niche.
That doesn't mean they're going
to be able to figure out all this
stuff that there's our counterpart.
That's been doing it for 10 years
and doing that only one thing.
So how can we possibly expect, you know,
if I did this one very, if I was an expert
on a single graph query algorithm in
Cypher and, you know, all of a sudden, you
know, you asked me to be able to design a
SQL database with X amount of reliability.
And you go, well, he's an expert
in knowledge database systems.
Like, well, no, I spent
10 years doing this.
And I didn't, by the way, just, you
know, you know, say that I spent,
you know, 10 years on this very, very
narrow algorithm for this one system.
I'm probably not going to be successful
in applying that to something else
when it takes somebody else doing
it for 10 years to know that.
And, you know, that's a, that's
a big challenge I see in the
tech world right now, which even
corresponds to, you know, what we're
doing from an information level.
It's the same thing.
You know, you're taking information in
these small chunks, you know, whether
it's social media or now using vector
embeddings, you're taking this information
and removing it and, you know, you're,
you're losing all that context, you're
losing all that context of how it was
initially connected within these systems,
the relationships, the underlying
relationships, how those relationships
are all connected, and you need all of
those connections to see the big picture.
And, you know, I think, you know,
archaeologists talk about this a lot.
If you study how, you know,
because that's, that's a term
I use a lot, in situ context.
You know, archaeologists talk about this
a lot when they do archaeological digs.
That's why, you know, they're so careful.
You know, they have these little like
brushes, like actually like paint brushes,
where they brush away the dirt, because
they're so terrified of Removing an object
from its context, because that's all the
information you have, you know, once you
find an artifact and you take it out of
the ground and you walk away with it.
You've lost that context.
Now you can't say, well, it was found
in this layer with these other artifacts
so that we can begin to say, well, these
probably come from the same time period
or the same event, or how are they
related because now you're holding it in
your hand and you've walked over here.
You don't have you thinking about
time again, you can't unwind that.
You know, you've removed it, you
removed its in situ context, and that's
really applicable here in the AI space.
Well, if you're just doing vector
rag, that's the same thing.
You're, you're taking these words or
these phrases, and you're removing
them from their in situ context,
and once you get them out of that,
they take on different meanings.
And if you don't have a way for it to
work, Reconnecting those relationships
from the from its original context.
That's how you get distant.
That's how you get, you know, I
don't want to say misinformation, but
you can get inaccurate statements.
You can get
statements that don't quite make sense.
You can get missing
details and all that stuff.
Is really difficult to spot unless you
really read every single statement at the
word level with a lot of in with a lot of
understanding about what you're looking
at, which is kind of 1 of the articles I'm
working out now is, you know, comparing
these results and saying that on the
surface, they all look good, you know, to
somebody that's not trained on the topic.
They all look good, but there
are different levels of quality
for lack of a better term again
on how we measure these things.
That we would attribute with each of their
responses from the different techniques.
Nicolay Gerold: Yeah, and I think what's
my biggest obsession in the space is
actually more learning patterns because
I think Ai especially as a field we steal
a lot from different areas like from data
engineering from platform engineering
from like like systems engineering as well
Like it's like a grab bag you put it all
together and I think when you actually
learn the patterns behind it You can make
way better decisions about, like, how
different systems should be put together.
And, I think that's also partly why
I want to do the podcast, because one
and a half hour of conversations, I
think you learn a lot from a person.
But I
Daniel Davis: because I understand
what it's coming from about so much.
I see what I call reinventing the wheel.
And I, I see so many people, you know,
for instance, building rag systems when
there are so many solutions out there.
There are many open source
solutions as many as many products.
And I go, why?
Why are you building it again?
Why are you rebuilding it?
And.
What I learned is it kind of goes
back to where we were talking about
a little bit ago is people, software
engineers jobs are so narrow.
Now they're, they're doing these projects
just to learn they're doing these
because they're so bored in their jobs.
They don't get to touch
any of these other systems.
And a lot of people are kind of
recognizing that they're going.
Wow.
I'm only working on this 1 really
narrow topic every single day.
And I want to try some of these other
things out, which, you know, then though.
There's some interesting effects
to this is that now we see all
of this code that's out there.
That's, that's kind of
just, it's a playground.
You know, people are playing with
ideas that they're experimenting.
And from an AI perspective.
You know, we know at least up until
now, and some people still think of
the scaling sense of, you know, pre
training more, more, more, more, more
is going to solve all of our problems.
So, you know, they could
be looking at this going.
Well, this is great.
All this new code is great.
I don't think so.
I think this is the garbage in
garbage out solution because now.
We have more code than ever before,
because so many people are doing
these experimental projects and it's
easier now than ever before to do it.
You know, at 1 time, you would
have to do a lot of research
on how to use all these tools.
And you'd have to manually write the code.
Whereas now you're not even having to
go to stack overflow to copy and paste.
You can get an LLM to generate a lot of
the code for you and it's working code.
Now, you, it may not work the way you
want it to, but it's working code,
so we see more code than ever before.
That's getting pushed out to, to
somewhere like GitHub than ever before.
But if you're trying to train
a model to be able to write
code, what coding practices.
Do you want to reward?
What are the best coding practices?
What are the best algorithms?
Which means how do you
even measure that anymore?
You know, we're going to come back
to some software engineers going
to have to be able to look at this
stuff and make a determination of A
is better than B is better than C.
And then you go, well, how did
you make that determination?
And a lot of it can come back to being
personal opinion or personal philosophy.
So it's a real, yeah.
Nicolay Gerold: that's actually
also an upside of Genitive AI.
Because I can now code much faster,
I can generate four different options
to solve the same problem, write the
different algorithms and test them.
Daniel Davis: That's true.
I don't see many people do that though.
See, my problem, so I don't think
a lot of people test their code.
See, I mean, that's kind
of, that's my observation.
Again, that's an observation.
And as over time, that observation
may or may not be, may not
have enough supporting ever.
Actually, that would be an assertion.
Excuse me.
That's an assertion.
That's my assertion based
on my observations, which
I can't support right now.
So that would be an assertion.
Is it?
I don't see a lot of
people doing that rigor.
I don't see people really, you
know, testing that because it's
a lot of work to do all that code
coverage and all that testing.
And it's not fun.
It's not fun.
But then you get into, you know, are
these the most effective ways to do this?
So I think you're right.
There are people like yourself that
see this as opportunity to wow.
Now I can quickly.
Do this 3 different ways and see
which way I like the best, or I
think the more common approach is.
LLM gave me code.
I copied and pasted.
I'm on to the next thing.
Nicolay Gerold: Yeah, I think if we could
drag this conversation on for like an hour
more, I think we have to go to the outro,
otherwise I Take too much of your time.
What is next?
What can you already teaser?
What's like on the
horizon for trust graph?
What are you guys building on?
Daniel Davis: Well, the thing that
we're really focused on right now
is our users have asked us about
better ways of managing data.
People want to be able to how they
load data in the system, be able to
granularly select what kind of topics
and data they have loaded so they can
unload, reload, do it dynamically.
Being able to have this Very reliable,
very modular, you know, infrastructure
running at all times and being able
to control what's deployed in it.
So they want more granular tools and that.
So basically, you know, you
know, infrastructure tools.
And then, though, we want to really
dig into what we're calling temporal
rag, which is going to be a big effort.
It's going to be, you know, I
kind of hinted at how we want to
approach it that, you know, facts,
observations and assertions.
I think we're going to start with
trying to create the, you know, create.
Keep those definitions
Nicolay Gerold: So what can we take away
if you want to apply this in production?
I think, first of all, building
specialized tools rather than building
one size fits the solution is, I
think, a very interesting approach.
I think this has been a loss, a
little bit lost in the ML and AI
and data space as well, that in the
beginning, tools focus on one specific
problem and solve it masterfully,
but then they go broader and broader.
And that's okay for a bunch of
different domains, but I think in
a lot of cases it would probably be
better if you stick with that area and
actually try to master it and become
the default tool, the de facto standard
for it, because you're so good at it.
And I think this has
been a little bit lost.
That you have single purpose tools,
domain specific applications, which
focus on one specific problems and focus
on the core functionalities, develop
clear interfaces for that, optimize
the performance, be a domain specific
solution and maintain the independence.
And I think with a lot of other tooling
and software solutions, it's overloading
a single tool, they get generalistic.
So they become like Excel, which in a
lot of cases, it's just like, you don't
even know what to do in it anymore.
And it forces a lot of generic
data use on top of you.
And the integration also becomes like
over complicated in a lot of cases.
So.
Instead of trying to build all in one
solution, I think you should think
about like building specialized tools.
And I think this framework is also really
interesting for, for software startups,
like where does it actually pay to
invest in specialized tooling as opposed
to adopting something more general.
And the, Tax decisions nowadays have
become more and more complex because
we have more and more open source
solutions out there, which are very
good solutions for a set of problems.
Um, the question is whether it aligns
with your specific problem at hand.
Um, and I think as engineers, we often
opt into like building something in
house, something specialized, but this
comes with a lot of cost and with a
lot of complexity down the road because
you might build something specialized.
For the moment, but this also has to be
maintained over time, and it has to be.
Adjusted features have to
be added to the use case.
And I think this is something that's
often ignored and these are just
tasks which are not often not directly
related to this specific problem.
You're solving for the
business or for the end user.
And I think specialized tools
or specialized software have
a place but you should be.
Really clear about what their purpose
are and why you are building them and
why you can use something that's already
been built before by other engineers.
And only if the reason is like,
okay, no one has done it before or
we need something so different to
actually solve the problem at hand.
Then you actually think about, okay,
I'm building something in house.
I.
Also love daniel's point on like robust
infrastructure I think i'm always a
little bit torn like that like they
have built a lot on apache power sound
cassandra So both of them are very
scalable technologies I also talked to
daniel's co founder and He recycled a
lot apparently from a previous project.
So I think like Take it with a grain
of salt, whether this infrastructure
is the right one for you.
Yes, it's scalable, but at the same
time, they have built with this, with
these technical components before.
So I think always like the best tech
stack to build with is probably the
one you have built with before, because
you already have the experience, you
know, like how to set it up, how to put
it together, what are the downsides,
the performance characteristics,
and especially at the early stages.
You don't want to spend your
time thinking about the technical
components and how they work.
You want to spend your time thinking
about the problem and how to solve it.
And when you go with solutions
you haven't, or technologies you
haven't used before, you will spend
a lot of time actually figuring
them out and mastering them.
Um, so this is an extra caveat.
And then actually once
you hit performance.
Issues, then you can start to think
about, okay, I'm doing immigration.
I'm switching to a more powerful tool,
but at that point, you're likely.
able to hire more help, or hopefully,
uh, you have a little bit of
more time on your hand that you
actually can build a better system.
When we look at the knowledge graphs,
I really like the favor simplicity,
so keep it simple and keep it modular.
So, rather than relying on really
complex and nested ontologies,
like for example schema.
org, which has evolved over years.
Try first to build something simple,
design a graph with a few core entities
and a few core relationships that
capture the essence of the domain, and
then over time add more and more to it.
You might also end up at
something similar to schema.
org.
A lot of people have spent a lot of effort
figuring Out this schema, but for you,
you want to build something that works
for you to main very fast, but also in a
way that you can wrap your head around it.
And what I always like to say,
like a complex system that works
has likely evolved from a simple
system that worked before.
So start with a simple system.
And adopt or integrate complexity
over time when it's necessary and
not just for the sake of complexity.
So when you look at a new domain,
define the current entities, design for
flexibility, so still keep it extensible.
So this is also something like maintain
optionality when you're building stuff.
When I am thinking about, okay, how
am I structuring a project, what am
I, when am I implementing, it's always
interesting to think about, okay, how
can I set it up in a way that it's easy
to extend it, that it's easy to add
new functionality or features later.
And when you manage to do that,
it becomes easy to adopt something
more simplistic, because you know,
like, when I have to make it more
complex, I can't do it down the road.
And yep, um, I think that's it.
So let me know what you think of it.
Um, I would love to hear your comments,
reviews, negative or positive.
Also, if you liked it and you've
listened for like eight minutes
now to hear me brabble in this post
review, um, leave a review on Spotify.
On Apple podcast or wherever you're
listening, leave a like on YouTube.
It helps a lot and also recommend it
to your friends, recommend it to your
colleagues, to your students, and
that's the best way to make it grow.
And otherwise I will catch you
with another episode on Knowledge
Crafts next week, see you then.
Daniel Davis: pretty
simple at the beginning.
My inclination is to try to
keep those simple going forward.
And, you know, that can be pretty
sophisticated, you know, analytics
algorithms, you know, underpinning
how we calculate things like
trust or freshness going forward.
But, I think that we have to try
to simplify these kind of things
so that everybody can there
can be some sort of consensus.
Or otherwise, I don't think
it's going to drive value.
Because, like I said, I've seen this a
lot in people chasing the white whales
of trust indices or trust metrics, or
even risk the more sophisticated you
make it, the more complex you make it.
The less buying you get, because
people can kind of point to
exceptions to every rule.
So really kind of starting
with a really solid foundation of
how we want to go through this.
And also, you know, it is going to
require a Bayesian approach because,
like I said, there's the before times.
And then there are after times, you
know, once you get data into the system,
you can start thinking about time
differently than what happened in the
past, which is something I think we're
just going to have to accept is, you
know, we can't necessarily understand
past information at this point.
With the level of granularity we
will be able to going forward.
Nicolay Gerold: Nice and if people want
to start building the stuff we just
talked about or want to build on top of
trust graph Where would you point them?
Well, I, I don't know,
is this backwards or not?
Trust graph.ai.
That'll point you to our GitHub.
So, you know, that's where we
primarily do everything is in GitHub.
We have our Discord, we actually
have our own YouTube channels
where we do tutorial videos.
And I actually, you know, talk to
other people in the space as well.
But yeah.
Discord is always a good way
to get in contact with us, talk
about other people and some of the
things that they're working on.
You know, we just recently did a
poll and that's why those are the
things people were interested in.
There were a couple of things that I've I
really thought people would be interested
in and so far we haven't seen it.
You know, I, I kept boy, I, I've been
thinking about taking a multimodal
approach to this for a while now.
And so far people just.
that hasn't been what they're looking for.
I don't really have, I'm sure there's
somebody out there that's looking
for that and it's going to go like
multimodal, you know, agentic rag.
Yes, I'm interested, but that's not
really what our users have no, nobody is
the, that wasn't what anybody voted for.
Surprised me.
Cause that was one of my suggestions.
So what can we take away if you
want to apply this in production?
I think, first of all, building
specialized tools rather than building
one size fits the solution is, I
think, a very interesting approach.
I think this has been a loss, a
little bit lost in the ML and AI
and data space as well, that in the
beginning, tools focus on one specific
problem and solve it masterfully,
but then they go broader and broader.
And that's okay for a bunch of
different domains, but I think in
a lot of cases it would probably be
better if you stick with that area and
actually try to master it and become
the default tool, the de facto standard
for it, because you're so good at it.
And I think this has
been a little bit lost.
That you have single purpose tools,
domain specific applications, which
focus on one specific problems and focus
on the core functionalities, develop
clear interfaces for that, optimize
the performance, be a domain specific
solution and maintain the independence.
And I think with a lot of other tooling
and software solutions, it's overloading
a single tool, they get generalistic.
So they become like Excel, which in a
lot of cases, it's just like, you don't
even know what to do in it anymore.
And it forces a lot of generic
data use on top of you.
And the integration also becomes like
over complicated in a lot of cases.
So.
Instead of trying to build all in one
solution, I think you should think
about like building specialized tools.
And I think this framework is also really
interesting for, for software startups,
like where does it actually pay to
invest in specialized tooling as opposed
to adopting something more general.
And the, Tax decisions nowadays have
become more and more complex because
we have more and more open source
solutions out there, which are very
good solutions for a set of problems.
Um, the question is whether it aligns
with your specific problem at hand.
Um, and I think as engineers, we often
opt into like building something in
house, something specialized, but this
comes with a lot of cost and with a
lot of complexity down the road because
you might build something specialized.
For the moment, but this also has to be
maintained over time, and it has to be.
Adjusted features have to
be added to the use case.
And I think this is something that's
often ignored and these are just
tasks which are not often not directly
related to this specific problem.
You're solving for the
business or for the end user.
And I think specialized tools
or specialized software have
a place but you should be.
Really clear about what their purpose
are and why you are building them and
why you can use something that's already
been built before by other engineers.
And only if the reason is like,
okay, no one has done it before or
we need something so different to
actually solve the problem at hand.
Then you actually think about, okay,
I'm building something in house.
I.
Also love daniel's point on like robust
infrastructure I think i'm always a
little bit torn like that like they
have built a lot on apache power sound
cassandra So both of them are very
scalable technologies I also talked to
daniel's co founder and He recycled a
lot apparently from a previous project.
So I think like Take it with a grain
of salt, whether this infrastructure
is the right one for you.
Yes, it's scalable, but at the same
time, they have built with this, with
these technical components before.
So I think always like the best tech
stack to build with is probably the
one you have built with before, because
you already have the experience, you
know, like how to set it up, how to put
it together, what are the downsides,
the performance characteristics,
and especially at the early stages.
You don't want to spend your
time thinking about the technical
components and how they work.
You want to spend your time thinking
about the problem and how to solve it.
And when you go with solutions
you haven't, or technologies you
haven't used before, you will spend
a lot of time actually figuring
them out and mastering them.
Um, so this is an extra caveat.
And then actually once
you hit performance.
Issues, then you can start to think
about, okay, I'm doing immigration.
I'm switching to a more powerful tool,
but at that point, you're likely.
able to hire more help, or hopefully,
uh, you have a little bit of
more time on your hand that you
actually can build a better system.
When we look at the knowledge graphs,
I really like the favor simplicity,
so keep it simple and keep it modular.
So, rather than relying on really
complex and nested ontologies,
like for example schema.
org, which has evolved over years.
Try first to build something simple,
design a graph with a few core entities
and a few core relationships that
capture the essence of the domain, and
then over time add more and more to it.
You might also end up at
something similar to schema.
org.
A lot of people have spent a lot of effort
figuring Out this schema, but for you,
you want to build something that works
for you to main very fast, but also in a
way that you can wrap your head around it.
And what I always like to say,
like a complex system that works
has likely evolved from a simple
system that worked before.
So start with a simple system.
And adopt or integrate complexity
over time when it's necessary and
not just for the sake of complexity.
So when you look at a new domain,
define the current entities, design for
flexibility, so still keep it extensible.
So this is also something like maintain
optionality when you're building stuff.
When I am thinking about, okay, how
am I structuring a project, what am
I, when am I implementing, it's always
interesting to think about, okay, how
can I set it up in a way that it's easy
to extend it, that it's easy to add
new functionality or features later.
And when you manage to do that,
it becomes easy to adopt something
more simplistic, because you know,
like, when I have to make it more
complex, I can't do it down the road.
And yep, um, I think that's it.
So let me know what you think of it.
Um, I would love to hear your comments,
reviews, negative or positive.
Also, if you liked it and you've
listened for like eight minutes
now to hear me brabble in this post
review, um, leave a review on Spotify.
On Apple podcast or wherever you're
listening, leave a like on YouTube.
It helps a lot and also recommend it
to your friends, recommend it to your
colleagues, to your students, and
that's the best way to make it grow.
And otherwise I will catch you
with another episode on Knowledge
Crafts next week, see you then.
Listen to How AI Is Built using one of many popular podcasting apps or directories.