S2E8

#025 Data Models to Remove Ambiguity from AI and Search

October 4, 2024 · 58:40

Nicolay Gerold: Taxonomies
ontologies and knowledge graphs.

I think at the moment, nearly
everyone is talking about them, but

nearly no one knows how to put them.

But before we look into that,
What are they good for in search?

So imagine you have a large Corpus of
academic papers, but a user searches

for machine learning in healthcare.

The system or the taxonomy could be
used to recognize machine learning as a

subcategory of artificial intelligence.

But we could also use it to narrow down.

So basically we could
take the term health care.

And identify that it has sub-fields
in diagnostics and patient care.

And we can use all the things
that additional categories

or fields we have identified.

To expand the query.

Or narrow it down.

And we then can return results.

That include paper.

So for example, on neural networks
for medical imaging, Or predictive

analytics and patient outcomes,
something that wasn't really in the

current or not even closely related.

But can have good results.

We can also use it to filter
down and remove papers.

So for example, we have our category of
artificial intelligence and we remove any

paper that is not tacked appropriately.

Because if they might just mention
the term artificial intelligence.

In a side note, but isn't
really a paper on the topic.

So we are basically building the
plumbing, the necessary infrastructure for

tagging categorization, query expansion
and relaxation, but also filtering.

So how can we build them?

And that's what we will be diving into
today in this episode of how AI is built.

So today we have Jessica talisman
with us, who's working as an

information architect at Adobe.

And in my opinion, she is the
expert on taxonomies and ontologies.

Let's do it.

Especially the starting point for me
is the most daunting is glaring at a

blank page and you have and a dooming
amount of documents in front of you.

I have no clue, like, how
to start in most cases.

Jessica Talisman: The way that I'm
sequencing it is metadata, then taxonomy.

So flat list to taxonomy to
ontology, schema, design.

And then from ontology to knowledge graph.

Oh, I'm sorry.

I skipped thesaurus.

Thesaurus to schema to ontology.

Nicolay Gerold: What's the metadata
you would actually start with to set

yourself up the best for building
all the stuff down the road?

Jessica Talisman: the
metadata that you would.

If you think of it as tags.

That enter a system if you're tagging
content or tagging resources, if we call

it resources or assets you start with
those flat lists and so designing for

it can happen in a number of different
ways where you can extract using vector

databases, or, using vectors to determine.

Those groupings and nomenclature.

You can also scrape.

So there's different approaches.

You can also use internal stakeholders
and have a metadata request or tag

request, or you can take existing
labels within a system and use

that as your starting point.

For a flat list, so I usually do
document extraction and then the

taxonomy process the bridge from
metadata to taxonomy happens when

you have to curate and disambiguate
and structure it as a hierarchy.

Nicolay Gerold: Yeah.

How does this, how is the interplay?

Because in the end the amount of
text I could use again are endless.

Jessica Talisman: Yeah.

And it can be domain specific.

For example, if I'm going to look
like another great starting point for

determining metadata, if you know what
your end result, if you can predict your

end result, like I, and your use cases.

For example, if I'm building a system
where it's necessary because of B2B

business to align with the outside
world, We call it outer world assumptions

that I'm going to want to reconcile
across openly available taxonomies.

That would be Google's product taxonomy,
GS1, IAB government taxonomies.

And then.

Use that as the starting point.

I have a very interesting exercise
that I suggest with people of trying to

determine what your starting point is.

I will audit the taxonomies and
determine what is the same and

what is unique or different.

It's a pretty exhaustive
process, but if you start to.

Gain clarity in the application of these
industry wide taxonomies, and then start

to apply that what's valuable and what is
not what aligns and what does not align

with your internal project or modeling.

Then it helps you leverage
what already exists as industry

taxonomies, those labels and the
structure while also inserting unique

domain where filling in the gaps.

So that's the 2nd part.

You do an audit across all
the industry wide taxonomies.

You can scrape or do
clustering from your internal.

Vocabularies and documentation, and
then you can fill in the gaps of that

taxonomy, what's not represented.

For example, Google taxonomy is
for ad tech and for e commerce.

So that's the use case for that taxonomy.

Are you going to, maybe if you're
not in a domain that's specific

to e commerce or ad tech, can you
still use Google product taxonomy?

Sure.

But you need to model it and retrofit it.

To apply to whatever the specific domain
or use cases are in your environment.

And those are called coverage models.

So building a coverage model means
that you're curating and designing

and building specific to the use cases
and needs of your environment or your

end result, your projected end result,
what you want to accomplish or achieve.

Nicolay Gerold: Yeah, and maybe to pick
up on the clustering part, because that's

something that excites me as well, because
we can bring AI into the process as well.

How would you actually approach it?

So you're assuming you have a lot
of documents you vectorizes in

some way, whether that's TF IDF, or
whether it's actually an embedding.

And how do you actually go
from like the clusters to the

actual tag of the clusters?

Jessica Talisman: that's interesting.

I use a tool for this just for low
fidelity and easy interoperability

and modeling called open refine.

I don't know if you've used that
tool, but it's, it's a decent product.

If we use.

That as a starting point, and you
cluster from within open refine,

you can download the JSON clusters.

From that exercise.

There are eight different algorithms,
clustering algorithms within OpenRefine.

So that just opens up the possibilities
without having to be constrained to

like a black box type system and being
able to quickly pivot and shape and

determine which cluster, which type
of algorithm will work best on the

data that, is it going to be Markov?

Is it going to be nearest
neighbor and determining how.

What shape does the data?

What are the clusters look like?

What's the proximity that
you're looking for now?

Commonly, if we're going to bridge.

between clusters, how you apply
clusters or how you use clustering to

build your initial taxonomy is that
you're looking for groupings where

you can obviously model a parent child
relationship, so broad to narrow.

So you're looking for those
inferences or those, and those

insights from your clustering.

From there, then, Not all the labels
that you've achieved through clustering

can be applied in a taxonomy, because
there's rule bases in a taxonomy.

You cannot create recursive loops.

So synonyms and such and antonyms are
captured, not as first class citizens,

but as alt labels or aliases to whatever
the main label is that you choose.

It really depends.

So for the clusters, it depends
on the starting data set, what

you're working with your material,
the clay that you're molding.

It really is like molding clay
and being able to determine how

the data is behaving with those,
whether it be vectors, embeddings

clustering, whatever you're using.

TF-IDF totally, totally fair.

But as you know, the
algorithm that you choose.

will behave differently on depending on
the data sets that you're starting with or

the documents that you're starting with.

How robust?

Yeah.

Nicolay Gerold: so basically you try to
find or look at the shape of the data

first by actually doing some form of
visualization and then basically figure

out, okay, how is the data structured?

Is it rather like you have clear
clusters or you have like more

of a fuzzy type of relationship?

Jessica Talisman: Fuzzy.

Exactly.

And fuzzy is not ideal.

And so that's where it's important
to play with lots of different

clustering algorithms because, my one
concern with vectors is that You're

going normally, unless you build
your own clustering environments or

vector environments, you are going
with whatever comes out of the box.

Oftentimes, you don't get to
select and but oftentimes you

can dial in according to TF.

IDF time, frequency and accuracy.

Nicolay Gerold: Yeah.

And have you ever tried
using node2vec as well?

Jessica Talisman: Node yes.

And I find that when you do that,
your data, you have to have a

starting structure or taxonomy.

It doesn't do well.

Trying to pick up or determine, like the
node2vec, and also this goes for text,

there's some text based text2sql you start
looking at some of the complexities if

you don't have a determined structure,
like I'm dealing with a three or four

level structure, and here's my framework.

Oftentimes it doesn't work as well
unless you have that starting point.

Nicolay Gerold: yeah.

And, I think the challenging part why so
many people actually get taxonomies wrong

is like the rules that you mentioned.

What are some of the other rules
that actually exist for taxonomies

that Aren't so well known

Jessica Talisman: I think one of them is
that it's an ISO and it's actually turned

into guidance over a strongly stated
standard that's make it or break it.

But I will tell that guidance
is make it or break it.

It has to do with information retrieval
systems, and it has to do with what

I call the basement of the Internet.

The big heavy hitters that are
the search engines like Google

and TikTok and whatever else do
follow some of these rule bases.

The ANSI Z39.

19 is a standard for monolingual
vocabularies and thesauri.

And that is one of the main Standards
that you want to follow when

designing taxonomy systems, and it
declares the the application of SKOS

which is simple knowledge ontology
system, which is an upper ontology.

So it's still an ontology, it's a
schema, but it programmatically encodes,

the, a hierarchy, a taxonomy, and
the rule base of that is a primary

label or preferred label, which is
your main label in the taxonomy.

It cannot also be an alt label or an
alias for other labels in the taxonomy.

That is very important because
that's where we create recursive

loops or self referential data.

The other main point of that
standard is disambiguation.

We think it doesn't matter, but if
you have a tag with no definitions

or you have a taxonomy node with no
definition, that's simply content.

A system's going to have a very hard time
making sense of a really global, ambiguous

content that doesn't have context,
doesn't have meaning, doesn't have

context by surrounding it with children.

So to speak in parents, you have to have
that sort of that modeling in there,

because although we don't realize it,
Out of the gate, we think of a taxonomy

is just structuring tags or labels.

That's the normal take on that.

But in fact, it's your starting
point towards knowledge graphs

that disambiguation and those
definitions are absolutely essential.

So if you think of your taxonomic
structure or hierarchy, if you add text

definitions, imagine how much more.

Richness you're introducing into
your ecosystem, moving towards that

context and meaning that we all seek
to achieve with knowledge graphs and A.

So that's very important.

Nicolay Gerold: It reminds me so much
of show rise to be honest Like all

the data modeling each column has an
owner each column has a definition And

each column can be traced back to the
documents or to the source of the data

Jessica Talisman: and
there has to be parody.

So you always want to
update and check your.

Your structure, your taxonomy, because
as it grows, you still have to, and

as you add to your taxonomy, you
still have to continue to validate

and make sure that you're not
presenting ambiguity into the system.

That's the whole point of modeling a
taxonomy is to start to remove that

ambiguity for machines and people.

Nicolay Gerold: yeah, and I like a big pet
peeve of mine is actually trying To remove

stuff first like whenever i'm starting a
new project Implementation or new project

in an existing firm or in existing system.

I first look at the stuff
I can actually remove.

How often do you try to basically
integrate like the removal loops where

you actually look at the systems?

Okay.

How is my channel taxonomy looking?

What can be removed in
like the general workflow?

Jessica Talisman: I do.

I do really encourage it's something
that comes from library and information

science that's called weeding.

That's what we've always called it
where you move through a collection

of any sort, including your
databases, and you remove noise.

Removing noise is the same
thing throughout all systems

and all domain practices.

Really important to curate and care
for your data set, your taxonomies.

But then also another part of this
is we also run into the problem when

we weed too much of eliminating those
signals that are essential and what

I mean by signals are those taxonomy
nodes that Not only disambiguate,

but help to balance your data.

And this is critical is that you can
have your closed again, your closed

world assumption, which is how you
model your own internal domain, but

you always want to keep your eyes or
your vision set a little more broadly.

To pick up signals from the outer world
that could line up with competitors.

It could line up with testing the
temperature and like social interactions

of industry related concepts that may not
immediately touch your internal domain.

Those signals are instrumental to helping
to curate your taxonomy because we

deal with industry trends all the time.

And in order to pick up on those
conversations, they, you don't know

that those what you don't know.

And what I mean by you don't know what
you don't know is if you only keep

your world so cloistered that you're
only modeling what exists in your

internal system, you're never prepared.

You're never prepared.

To catch onto, or to pick up on those
signals that are indicating that the

market is shifting or the industry
is changing, or there's this new

technology, or there's this new concept.

Nicolay Gerold: Yeah, and you already
mentioned scraping taxonomies is

how does your process look like when
you're going to the competitors?

Most won't have the taxonomies online.

How do you actually try to get or get
insight into how they are structuring?

their data and building their taxonomy.

Jessica Talisman: I, I only go
to taxonomy downloads, like where

the taxonomy is actually surfaced.

So I would never go to a browse menu
to determine a company's taxonomy.

Because that's not necessarily that
your taxonomy, it's the display.

Labels that you're seeing.

And so we tend to enter this realm of
confusion when we assume a taxonomy

is a browse menu or a site map.

Oftentimes like those only represent
what's required for indexing a site.

Or simply for the UX UI the real goal.

And you can scrape for those types
of labels within the code base.

That's fine, but it again is very
difficult to infer because most

enterprises have redirects and
backlinks and such like that.

So just to make the site.

Operate.

So I tend to go to the actual
downloads of those files and it's

just spreadsheets in those cases.

And when you're looking at these
spreadsheets again, like I usually

very quickly just go into open,
refine to cluster and sort and

determine like trends or any sort of.

as to the, like, how the data has
been modeled to form a taxonomy

because all labels not all taxonomies
that you find online are going

to be validated and legitimate.

That's just the truth.

And so you have to do your own.

Own work.

That's why it's dangerous to
use openly available taxonomies.

The other problem with openly
available taxonomy is sure you

can validate them in the system.

But again, the use case, the
original use case of that is

always start from the use case.

If it's a product taxonomy like
Facebook's or or Google's, those are for

e commerce and advertising and marketing.

So they're going to have
huge gaps in coverage.

And you may not realize
that those gaps exist.

And so you want to compare it to if
you have any internal, even if it's a

flat list, determine again where those
gaps are in that industry taxonomy.

If you use, for example, the IAB taxonomy,
that's for marketing and content.

And advertising, if you use the,
that's for media, advertising,

marketing news that sort of thing.

Understanding and actually
doing your homework about.

What's the use case?

What is the intended use or
application of this taxonomy?

And then starting from there and
looking at and being very clear

about what your use case is Then
you're looking for gaps and coverage

Nicolay Gerold: And you
already mentioned the use case.

Let's assume you already have created
a taxonomy for a specific use case.

How do you actually know
that the taxonomy is working?

Jessica Talisman: Ah, so
Usually that has to do with TF.

So you want to do, you want to measure
relevancy, accuracy of those things

that it depends on if your taxonomy,
what the application of your taxonomy is

for for example, if you just have tags
on content, you can obviously measure

the performance of the content, whether
it be from like an SEO perspective,

like that sort of a rudimentary.

Way to measure and to determine
like find ability is a huge one.

Like how find a ball, you
can do tests around that.

Outside of the SEO use case
you're looking for things out of

the gate, like tag completeness.

Have you achieved tag completeness
on your, on whatever the

content or the data objects are?

The other thing that is a dependency
though, and this is why that's a

hard question to answer is how.

What is if you think of the
implementation of a taxonomy, it

may be related to representing, for
example, data objects in a database.

So then there's that's why metadata
is in schemas are important.

We're not even at ontologies yet, but
we're looking at metadata and schemas.

Does your.

Databases your data ecosystem have
a data schema validation registry

and validation implemented to
govern and to help to enforce the

metadata that has been extracted.

For example, you might have a
schema that has required fields,

and you may have topic tags that
are derived from the taxonomy.

And there's a requirement that you have
to have at least one tag from topic.

From the topics in your taxonomy, no more
than five, and that then the data objects

exist and that you can consistently say
that the application of those taxonomy

nodes within the database are complete.

The thing about taxonomy
is it's infrastructure.

It's the plumbing.

It's the pipes.

And so oftentimes, the way
that you measure the success

of a taxonomy is secondary.

You're not necessarily going to achieve
bold faced, results right out of the

gate that are completely apparent.

It has query times, when you're looking,
for example, across a dataset and

you're querying in a SQL database.

How long does it take you to find results?

Are those results accurate?

Are they complete?

You're looking for ambiguity.

You're looking for issues with find
ability throughout a company data catalog.

There's so many different levels.

You can make connections or
integrations with B2B customer data

sets and databases more seamless.

You can achieve higher customer
satisfaction with your customers.

So there's so many different
ways you can slice and dice it.

It's really hard to, without knowing the
nuances of a system and again, the use

cases how to design those success metrics.

Nicolay Gerold: Yeah, and I think
for me, the most challenging part

among others is actually coming up
with clear and intuitive labels.

And in a taxonomy, this is like
already a problem in classification

cases especially nowadays, most people
just use an LLM for classification.

So the labels have even more importance.

But the labels for the people
who are actually doing the work.

of creating the training data have
a massive impact on in taxonomies is

actually like super charged even more.

How do you actually try
to find the balance?

Because the goal would be like a
MISI label, like mutually exclusive,

but completely exhaustive, but
that's never completely doable.

So how do you actually approach
that problem and decide?

What labels are the right ones?

What synonym to actually
pick as like the main label?

Jessica Talisman: yeah, so you usually
want to use sources internally and

externally to determine the right label.

I usually go to large scale knowledge
graphs like DBPedia, Wikidata,

not Wikipedia, but Wikidata.

You can access their ontologies and
taxonomies pretty openly and publicly.

And again, By determining what most
people in the industry are using for those

labels and again, applying definitions.

So that disambiguation
process is critical.

And then you can harvest from those open
source knowledge graphs or taxonomies,

you can harvest the preferred label
by way of volume or consistency

or parity between the systems.

If everyone in the industry calls content.

Docs, you might choose docs over
content because that makes sense

because that's what the world calls it.

However, you have to mitigate
and negotiate between

internal systems and external.

The most important thing and your
most critical role in, in shaping

what those labels should be are
definitions and disambiguating.

So never have miscellaneous other those
sorts of labels miscellaneous and other.

Not well defined.

It's a junk drawer.

And so oftentimes you'll see those
surface only because you're starting

with a base data set, maybe it's
your own, where miscellaneous

and other are used throughout.

And so it comes back as a
highly regarded necessary label.

That's not necessarily the case.

So eliminating ambiguity includes
everything must have a defined label.

Um, sure you could say miscellaneous
or other, or we haven't decided yet.

Maybe that's your definition for it.

Like we're not sure.

If you think about it from a practical
standpoint, that should be your

answer, is if you're defaulting
to those ambiguous labels, then

maybe it shouldn't be in there.

And the other part is that when
you're pushed to be considerate

and intentional about how you label
things the other thing is that an

ambiguous label could be accessories.

That's a great example
from the e commerce space.

Accessory to what?

So disambiguating things that
do have actual constructive

labels but don't make sense.

As a standalone or without a definition,
are they clothing like women's

accessories or, how are you going
to define or curate that further?

And always think about it.

If you went to a grocery store and
you're looking to find something

specific I don't know spices.

You rely on those labels.

So being as specific as you can, and
knowing that with those labels that

you apply, you're helping someone
navigate an environment, much like you

do a grocery store or an online store.

So determining what are the other things
that this could be called, you can do

by again, Scraping or you can create
algorithms or take an open refined

knowledge graph approach to be able to I
don't know if you've seen the reason why I

like open refine is there's an RDF plugin.

Which gets you closer towards graph
and there's schema reconciliation.

So you can connect OpenRefine to these
large open data sources, create a schema

in there and then reconcile it against
the schema and it will pull up all

the results and then you can actually
gather the synonyms from that exercise.

Nicolay Gerold: Yeah.

And, in the end what part of the taxonomy
is exposed to the user is very different.

If you're picking up on the supermarket
example, like for example, spices is

a sensible level of granularity, but
you could go even more fine grained.

How do you actually decide like
where, at which level of abstraction

do I start and end the taxonomy?

Jessica Talisman: Oh, that's interesting.

So you one of the risks in taxonomy
building is mixed granularity That's

like a pitfall in taxonomy design and
building Mixed granularity is when you

have an extreme level of granularity
at the bottom node or the leaf But then

other leaves are broader So a machine
tends to gravitate towards the most

granular Which then makes it difficult
to reconcile the broader there, you end

up with a very imbalanced data set that
can be confusing to machines as far

as what the tasks or jobs to be done.

So I always err towards the first
iteration is broad on purpose.

Spices would be the lowest.

I haven't gone into garlic powder
or pepper or anything like that yet.

I want to see how this behaves in the
wild and where those gaps are, because

it's much easier to add, to build
upon, to scale than it is to take away.

The reason why it's harder to take
away is people, for the most part, are

going to be the ones operationalizing,
ultimately, those tags out there.

And you'll have human behavior.

Some people will gravitate towards
being super granular, others will

gravitate towards it's fine, we'll
just do the broadest possible, like

grocery store, and just tag everything
with grocery store, or whatever it is.

Um.

And so that's where it's easier to
determine by volume and need and workflow

and then add the granularity as needed.

Because the type of granularity
you're looking for may actually

be attribute dependent.

You may be looking for spices by
attributes, so your spices might be

like spicy spices or whatever it is.

That's more.

It's still a global, but
a little bit narrower.

So really looking, because if your
ultimate goal is to continue to scale

and grow and possibly achieve like a
thesaurus sort of NLP activated type

of environment or moving, Further
into ontologies, the dependency is

that primary modeling that you do
with your list or metadata that then.

becomes your taxonomy.

So those iterative approaches,
all those design decisions feed

into the later iterations and
scaling of your information system.

Nicolay Gerold: Yeah, and I think like
In my opinion e commerce companies

should have the best taxonomies
because they have the easiest Or they

have it the easiest to link it back
to search behavior and user behavior

Jessica Talisman: Yes.

Yes.

And something that's interesting is like
commonly, and it's, I don't think I've

come across an e commerce platform that
doesn't activate like attribute value

type relate, entity attribute value models
because it's so attribute dependent.

And marketing cycles and sales cycles
and holiday cycles and seasonal

cycles are so fast that's like an
instrumental part of how taxonomies

are utilized and implemented.

That would be a case where too
much granularity gets you in

trouble in the e commerce example.

Because you're not going to have for
example, the attribute might be related

to a season, like summer, winter,
fall, spring, or it may be relevant

to like a holiday, like Christmas or
Hanukkah or whatever you, e commerce

companies tend to leverage at that point.

To get to that granularity.

Nicolay Gerold: Yeah, and I think
like you have so many Interesting

tools like head tail and torso
curves, which you can use to basically

Find what are the like top level?

Especially in the mind of the
user, which is really interesting.

How do you actually differentiate
the taxonomy of experts and

the taxonomy of the end user?

Is it sometimes necessary to
build two different taxonomies,

like a user facing one or

Jessica Talisman: oh, yes.

Yes.

A display taxonomy would
be what that's called.

And so maintaining and that's
also useful because say that your

customers include brand and marketing.

Like internally, you have some
less technical stakeholders as

well that request or require the
display label for the sake of the

customer to be just so because
branding and marketing require that.

And so that negotiation between
customers or the customer facing

display labels or taxonomy is.

Is necessary because it's important
to have stability on the back end,

branding and marketing should not be
deciding or determining what lives on

the back end, nor should customers.

Having the stability of those labels on
the back end from a taxonomy perspective

is really important and then letting
customers brand branding and marketing

decide on what those display labels.

Nicolay Gerold: And then the
two taxonomies are basically

connected through some kind of map.

Jessica Talisman: So you can map
them with equivalency statements.

So using something like scose or
owl, you can you just simply use same

as from an ontological perspective
and it draws parity between, or

you can use crosswalking, which
basically uses the same ontological

or schema based elements that draw
parity between the two taxonomies.

Nicolay Gerold: What is actually, or
what are the best taxonomies you've

actually seen so far in the wild?

Jessica Talisman: Some of them are a
little bit outdated and some of them if

I think of like Amazon's browse nodes, if
you can find that, I think it's a pretty

interesting taxonomy and throws a little
wrench in the plan of Google and Facebook.

It's an interesting alternative view.

And so I appreciate.

Amazon's browse nodes a lot.

That's a pretty solid structure
and I think some of the government

taxonomies I'm partial to like the EU
taxonomy for jobs and employment and

things like, and domain expertise.

I think that's a pretty interesting one.

I haven't been completely wowed by the
By any, because for example, once you

get into certain taxonomies, there might
be, they can be so vastly different.

It's hard, it's like apples and oranges.

There's not, they're not really, it's
hard to draw equivalencies or the

designs and implementations like the
GS1 taxonomies are relative to the BRIC

schema and are very attribute heavy.

So the construct of those taxonomies
and again, the use cases, the intended

use case of those taxonomies can
be so vastly different that it's

hard to draw equivalencies and
say, yeah, this one's the best.

But I would say, check out
Amazon Browse knows secondary one

second to that would be the IPTC.

Taxonomy, I appreciate that one.

It's not perfect and there's some
ambiguity, but it's pretty good.

Nicolay Gerold: I'm wondering when
you're operating in like a tangential

space so for example, I want to
create a taxonomy of industries.

Can I also use a related taxonomy?

So for example, of chop types
to try to basically bootstrap a

taxonomy if I can find one in the
specific thing I'm looking for.

Jessica Talisman: Yes, you can.

So there is the sort of merging,
but then you'd have you mean

like merging the two tax.

So you have an industry taxonomy.

You would have to disambiguate.

Those data sets and again, going through
that gap analysis and the coverage

model approach where you're finding the
commonalities, the equivalencies within or

between those taxonomies and the new the
gaps looking for the gaps because you're

going to want to account for the gaps.

And the reason why I say that is
that when you look at how industries

are defined, they can, there can
be huge gaps in those definitions.

I actually just modeled this.

I think we chatted about it.

Offline but the gaps that you account for,
you will find that, that some industries

are modeled to be very high level
and they align with the stock market.

That's one good example
of how the stock market.

How we define the industry there.

Others are defined
alternatively by commerce.

That's, those are very different views or
different ways of modeling or structuring.

Is, are, is, one, one taxonomy, and
this is just an example, IPTC, I think,

calls it arts, media, and entertainment.

That's the industry.

Another taxonomy that I've been working
with, Just calls it entertainment, right?

And then yet another taxonomy,
which is an industry taxonomy has

media and entertainment, right?

So those are three parallels.

But do you see how arts, media and
entertainment would be the highest

level and then the next level down
if you're combining the three?

Is media and entertainment and then the
child of that becomes entertainment.

Nicolay Gerold: Yeah.

Jessica Talisman: I wasn't
expecting the balloons, but yeah,

it's something to celebrate.

So like you have you have to be able
to iteratively account for those

and that's where clustering helps
because often depending on the type

of algorithm you're using The clusters
will appear with all three arts media

and entertainment Those all cannot be
the same level the logically it's not.

Nicolay Gerold: And, when you mentioned to
me you're a big fan of Scoss, especially.

What is Scoss?

makes it different to other ways
of approaching building a taxonomy.

Jessica Talisman: so other ways
would be like in a spreadsheet

and maybe representing it as owl.

That's the other scheme I can imagine.

Or you could use just
primitives from Jason.

That's just parent child.

Type structuring, that's fine.

But SCOSE has a validation framework
that will catch things such as

recursive loops or self referential
data, which basically you know,

obviously it's not ideal or optimal.

We know that recursive
loops are not great.

And relationship clashes.

That's the other thing.

Relationship clashes are
an issue within a system.

So I like this goes because it's
supported by just a native intuitive

validation framework, which is helpful.

That helps to test how your taxonomy
is going to behave with machines.

That's really important.

The other thing it keeps you
honest in the structuring.

You may not love it.

But it's going to give you that
feedback throughout as you model.

It's the MVP or minimally viable
product as far as ontologies go,

because you're modeling with ontologies
as soon as you initiate SCOS.

It's W3C compliant, like
that's pretty great.

So you know that it's going to
be machine readable as well,

based on the specs there.

And the Skoce XL is is the extension to
Skoce that makes it compatible with OWL.

So if you're wishing to get towards a
knowledge graph, Skoce will model or

will shape your taxonomy to be graph.

So you've already Started your
work towards the knowledge graph

naturally by doing that, so
it makes it machine readable.

It gets you closer to graph.

It has the validation framework
and it's ISO compliant.

Nicolay Gerold: Yeah, I have
I have two areas to go into.

Let's start with one first,
which is not really related.

We mentioned Europe before, and I think
in Europe, One issue is like, there

are many languages that are actually

Take a taxonomy and make it multilingual?

Because you cannot really
translate the labels one to

one because it just won't work.

Jessica Talisman: You want to
incorporate or integrate within your

taxonomy, like localization standards.

So the same isos that you use for
localization and globalization,

you can apply and is.

Compatible with that's the other great
thing about is that you can include those

multilingual labels and translation layer
as long as you line up with either the ISO

for the two letter or three letter code.

And I know Google places is coming into
purview, so you don't need an exact.

Physical address, for example, when you
get into like geolocation type work,

but the idea is implementing those isos
the isos specifically for globalization

and localization language and country
with your implementation of scopes.

So the two are symbiotic and
there's automatic, you can support

automatic translation of the labels.

Now choosing what those labels are.

I think that EU Vogue
has done a pretty good.

job of doing that, of incorporating it.

It is tricky, but it is, I've been
successful in using SKOS and implementing

the ISO for language and country.

Nicolay Gerold: Yeah.

Now, if we disregard the like work
volume, but focus only on how to get

the best or highest quality taxonomy,
will you actually build a, The like,

assuming we have like German and English,
would you build the two taxonomies

separately and then merge them?

Or would you build one
taxonomy and then translate it?

What would be your

Jessica Talisman: I usually do
it, I usually do it as I go.

So for example, I'm about to start
building a large taxonomy and

granted I'm using top quadrant.

So we have an enterprise
semantic middleware that sort

of has all this baked in.

You have to select what you want
baked in terms of your setting.

But to, to your point,
I'm building from SKOS.

I have the ISO for language country.

So that's done.

And then automatically it translates
the labels as I'm building.

So that's the ideal, is that you're
accounting for it while you're building.

Now, with some instances and use cases,
or depending on where you work or who you

work for, that's required by another team.

I'm thinking of Amazon.

Like, when I worked at Amazon, there's,
or even here at Adobe, where I am

right now is you have a local team.

They're going to want to review
and approve and oftentimes finance

and legal can be involved as well.

For approval of those labels.

That's why getting back to
an earlier conversation.

It can be dangerous to have a I
or vectors completely create these

classification schemes for you
because oftentimes for legal reasons.

Those labels.

Those localized, globalized or
translated labels have to go

through some approval process.

But if you line it up again with industry
standards and rely on something like

DBpedia to help to harvest those synonyms
and primary labels, those systems have

already been localized and globalized.

You can download any language you
want from any of those systems.

Same with if you want to work towards
the thesaurus, you might use WordNet.

Same is true there, or BabelNet.

There's some beautiful knowledge
graphs and work that has

already happened in this space.

It really depends on how much
longitude and latitude you have

for making executive decisions.

Nicolay Gerold: Yeah, and I think a
lot of applications of taxonomies are

like search related, whether that's
tagging or whether that's actually

filtering out documents after the fact.

What are some more of the
creative ways you've seen that

taxonomies are applied in industry?

Jessica Talisman: Yeah so this was an
interesting one is at a former job,

we had an author compensation model.

It was a content driven company
that author compensation model.

They use the taxonomy with an
algorithm that they built to be

able to detect if the compensation
patterns and results, so how much

they were paying authors was accurate.

It turned out that they discovered
that over five years, I think they

would save over 8 million based on
they had been overpaying authors.

So they were able to use the taxonomy
that I built in order to detect.

Overpayment, and it had a huge
impact on the bottom line.

Another there's Oh, sorry.

Go ahead.

Nicolay Gerold: How did that
actually look like in practice?

Like how was the map between the
taxonomy and the overpayment created?

Jessica Talisman: Ah,
so the nuances of it.

I can't really talk about, but I
will tell you that they use the

taxonomy from a topical perspective.

So once the content was tagged.

The author, related to that piece of
content, because there were so many

duplicate labels or near synonyms
within the system, they were being

overpaid because the payment had to
do with the coverage of the content

itself that the person had authored.

And so when they eliminated or
collapsed, because my new, my taxonomy

collapsed, a lot of those labels
accounted for them as synonyms.

Microsoft Azure is the same as Azure.

So someone was being paid
twice for teaching a course

on Microsoft Azure, right?

So by just cleaning that portion
up and having a realistic, view or

coverage model, they were able to
refine and collapse the categories

from which they were paying.

So that was helpful.

Nicolay Gerold: Yeah.

So how it lives in my brain, basically,
based on the tags, you could establish

like a market size or an audience size

Jessica Talisman: Yep.

Yep.

Exactly.

Nicolay Gerold: which you can
then take the compensation

Jessica Talisman: as well.

Exactly.

So that if their royalties,
related, that's exactly it.

That was the maturation of that exact
framework was the royalty model.

So there's the base
compensation and then royalties.

So that was like very
interesting application of that.

And it's a huge savings for an
enterprise that's content driven.

Think a lot of us know
about the EU green taxonomy.

That's a pretty interesting one.

The green taxonomy enforcing a
company's compliance on zero carbon.

And giving tax credits and kickbacks
for complying or meeting your green,

the idea of greening your business.

So that and governments
for governments as well.

And, taxonomies are often used with
cyber security or security systems.

The nice.

Framework or taxonomy.

That's an interesting
application to catch bad actors.

There was a really cool taxonomy and I
happened to get caught in it unknowingly.

I didn't get my us tax refund a
few years ago and I called about

it and it turned out that they
needed to verify my identity.

I had to go through all these processes
and they were doing a beta test.

The us government or us treasury
was doing a beta test with a new

taxonomy to catch bad actors who
were trying to get your tax refund.

By filing your taxes
before before I could.

And so I happened to be caught up
in this tax, where the taxonomy

was trying to capture bad actors.

And the idea was to cap, to, to sequester
users based on certain parameters that.

Would put them into this grouping
and it was because I had moved I'd

moved to addresses three times and
in a couple years I think that was

the trigger so Taxonomies are very
useful for creating for catching

bad actors and systems as well and
have been used for years in that way

Nicolay Gerold: Yeah.

That's really interesting.

I think I always tend to ask
underappreciated and overrated.

What technology, in your opinion at the
moment, is underappreciated and overrated?

And I think, especially on the overrated
part, I'm really excited for your answer.

Jessica Talisman: the overrated.

I mean if I say AI

I Think that AI to a certain extent
Is like just how it's been marketed.

Text to SQL.

I know I'm going to sound like
a sour person on this, but I

see a lot of applications or
implementations where we try to skip

the necessary steps to structure data.

And we use these, text to SQL, I think
can be successful to a certain extent,

but I haven't seen it scale well yet.

I want to see it scale well, but I
haven't seen it scale well because

we're trying to impose structure like
a relational database rules in, in

things like property graphs or AI,
and we're trying to force something at

this point that, uh, for structure upon
something that inherently doesn't have

the necessary structure required for AI.

Nicolay Gerold: Yeah.

To be honest, I think the term AI
has completely lost its meaning.

Jessica Talisman: Oh, it has.

Nicolay Gerold: In even in the AI Act,
like the definition they placed in

the first draft, it everything from an
if else condition could have been AI.

And I think in the hype cycle at the
moment, we are way past expectations.

Jessica Talisman: Way past expectations
and lots of bets placed around it.

And, it feels like coming out of COVID,
it was easy to grab onto this Holy

grail, cause the industry expanded so
much that as it contracted AI was like.

A soft landing in some ways
and hopeful and promising.

But the other thing is that with AI, we've
abandoned these core principles of data

management and think that, Oh, we're like
breaking free from these shackles and

we don't have to do any of this anymore.

We don't have to curate and
care for our data with AI.

We'll just do it.

And so I think that.

The negotiation with humans is part of the
issue and then also realizing the actual

capacity of a I like it's algorithms
like it's statistical reasoning.

It's not.

We tend to look at it and
expect it to be an Oracle.

And it's not that

Nicolay Gerold: and on the other
side, underappreciated, but I will

force you to not say taxonomies.

Jessica Talisman: forced
me not to say taxonomies.

I think something I'm turning to beyond.

I would say knowledge graphs are gaining
some appreciation, but ontologies I do

think ontologies are underappreciated.

And that even though ontologies are
required to create knowledge graphs,

and that's the fact I'm sticking with,
like ontologies can, there are lots

of other applications for ontologies,
such as structuring taxonomies.

So can you have a taxonomy
without an ontology?

Sure.

But.

How powerful, meaningful will that
taxonomy be if it doesn't, if it's

not modeled using an ontology?

I think that the pairing of the two, but
also the application of ontologies to

help to structure and govern taxonomies
outside of a knowledge graph, just

using the application of an ontology
is something that's not appreciated.

We tend to want to skip to the next shiny.

Cool object.

And that happens to be knowledge graphs.

That's the other thing in the center
of the tech radar alongside a I.

And the truth is that you can't
get to a knowledge graph without

taxonomies and ontologies.

Nicolay Gerold: And if you could
force one belief on every AI person

in the world, what would it be?

Jessica Talisman: I think how that
the although we want to give all the

jobs to be done to machines that.

We still need humans to shape the
data, define the data, curate the data.

Because if we want to get the results
we're expecting as the end result.

Then we have to have a hand in
how we represent that data period.

So taking an out of the box, taxonomy,
for example, may be great, but

you're going to find lots of gaps
that you don't even know exist.

And you're going to build up
more tech debt than it's worth.

So taking the time up front to really
shape things properly is going to make a

huge difference in your implementation.

And the results that follow.

Nicolay Gerold: Yeah, and if people
want to follow along with you, hire

you for building a taxonomy, or
stay up to date with your upcoming

book, where can they do that?

Jessica Talisman: LinkedIn's
really the best place right now.

Lately I've been, I actually have
something upcoming IA for AI,

which is a conference that we're
doing through Swarm Community,

and that's October 10th and 11th.

So I will be co hosting that, and I
have some pretty interesting interviews,

including Kirk Cagle who's amazing.

You may have seen his articles
around AI and structuring data.

That's coming up and then I'm
going to be presenting it on a

panel at connected data London in
December and then data day, Texas.

So I'll keep on posting here and there.

I'm deep in build mode right now,
which usually means that my brain's

busy building and less on LinkedIn.

But I'm happy to chat with
anyone interested to hear.

Any feedback or comments and continue
conversations because I think dialogue

is where, we really evolve a lot of
these ideas and share across domains.

So what can we take away?

I think, first of all,
whether you like it or not.

In some form of way, you're working with
a kind of taxonomies or some kind of tax.

Whether that's in
classification, in an appeal.

Whether that's in, in search with your tax
and your filtering and inquiry expansion.

And.

I think her point that.

Humans will always be needed to
shape the data in some kind of way.

We'll hold true or should
hold true to a large extent.

Because we still have to.

Give the, the models, a boundary.

For the system in which they are working.

And.

It's it's similar to what show rise
is doing with this, uh, whole data

modeling that you should spend.

A fair amount of time on the data model.

On how the data looks, and this
is very application dependent and.

I think.

You should really try to target, like
whatever you're doing, like the classes.

The different tax you're using
the taxonomy you're using.

If you're so far or the knowledge graph.

Is very application.

Specific in the end.

And.

The various, probably no one
who can really build it besides

you, or besides the subject
matter experts in your company.

So you should probably
spend quite some time for.

In delineating what the different tacks
can be and how they relate to each other.

And.

Also how they in the end can relate
to the business value in the end.

I think that's a really interesting
example with the authors.

That were overpaid.

Just because they were classified
in the, in the wrong field.

So basically which they were writing for
which in the end gave a wrong estimation

of how large the target audience can be.

And.

Especially with edit LLMs at
the moment we see like how much.

Impact,

It can have how a field is labeled.

And.

Taking an out of the box taxonomy
yeah, I would have never expected

that can have such an impact.

But also that the, you should
first investigate the purpose for

which the taxnomy was created.

It's also a good point.

Like the purpose determines
in the end the application.

So I'm working on a use case
right now where we would need a

taxonomy or ontology for industries.

But we really can't find.

One yet.

So we might have to create our own.

So, yeah, I think this was a really cool.

Cool interview.

I think I learned a lot.

Especially.

I love to talk to people who are
like in, in different fields and

have very different perspectives.

And I think there we can.

Learn and probably transfer
the most in our own field.

Because they have a very different.

Way of thinking.

And yeah.

What Jessica is doing is really cool.

She has a conference coming up soon.

So go check her out,
stay in touch with her.

And otherwise we will be continuing
our series on embeddings.

Next week.

And I will see you soon.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#025 Data Models to Remove Ambiguity from AI and Search

Subscribe