Summary
- Webinar with Bill Inman on evolving data and AI trends.
- Compares structured data with text processing and AI limits.
- Emphasizes databases for scale, quality, and visualization.
- Explores business LLM customization and metadata challenges.
Chapters
Welcome to Data Explored, uh, two. In this, uh, particular, uh, series of, uh, webinars. We explore hot trends in, uh, in the data and AI space.
And, uh, my name is, uh, I'm the host Today, I have the pleasure of being the chief Evangelist in action, uh, role that I'm very happy about. And, uh, one of my absolute, uh, favorite things to do is to, um, is to host this webinar series that is 100% not about action, but about very important topics in, uh, the data and AI space and the tech space that we can all learn more about and should learn more about. Uh, today I have, uh, three guests with me, uh, uh, a primary guest, if you will, and then two panelists that will expand our, um, understanding of, um, of the topic at hand.
The first guest is, uh, bill Inman. Bill Inman is, uh, an American computer scientist and author of several very highly influential books on data architectures and technology is then the inventor of the data warehouse, and he joins us in that capacity and as the inventor of the textual warehouse, which is the topic that we'll be discussing here today. Also, we have two panelists that will, uh, shed some new perspectives, uh, on, uh, this topic that is the chief data architect and the corner of the content quality management idea.
A term that is highly relevant for ai. Also with us is Jessica Taliman, an independent information architect with, is her own consultancy, the corner of the ontology pipeline, uh, which is likewise a term that is highly relevant to ai. And so what we will be discussing today is the textual warehouse, um, that Bill has talked about, uh, in several books and, uh, in numerous, uh, presentations.
And so I will set the stage a bit and then I will, uh, enter into a conversation with you, bill. So first of all, yeah, and, and before we dive in, before we dive in, I want to say to the audience that there's a q and a box that you can use for questions. Please use that QA box of is such that I will have around half an hour conversation with Bill, five, 10 minutes conversation with, uh, first and then Jessica.
And then we open the floor of the questions that you may have, um, in the audience. And so that's it really. And with that, I want to set the stage a little bit with, um, with introducing the concept of this talk.
So basically, bill, you are known as the father of the data warehouse, coined that term many decades ago. Um, and it was very visionary of its time. It took long time before, um, the data community, uh, adopted.
Uh, the term understood what you meant. And, and I will say very openly that I am a big admirer of, uh, such an approach to, uh, talking and thinking and educating, uh, everyone on tech. Uh, I admire terminology that sticks longer than hype.
And I think this is one of the absolute best examples of that. The data warehouse, you fought for that concept through many decades before it was adopted, and now it's simply table stakes. Every single company in the world of a certain size and of at a certain level of industrialization has a data warehouse.
That is not something that you discuss.
Now, you have also put forward a new idea, the textual warehouse. It's not that new, actually. You have been talking about it in many books.
And I personally had the pleasure of seeing you presenting at Dayday, Texas when you presented this idea. And so I think, bill, that you have done something not once, but twice. I think that the textual warehouse is something that companies in the future will have just as they have data warehouses.
And so that's why I'm interested in exploring the textual warehouse, especially 'cause we're seeing the first movements towards such a direction now with a big hype or rise of AI that we have in, in a, in, uh, in the, in the present in these years. So with that, I want to get closer to the subject that we we'll be discussing, but before I begin ask you questions, I think I would like to open with a quotation or citation from your book Bill, and then ask you some very open questions, not about the textual warehouse, but about something else. And you'll see what of what it is.
So I'm holding up the book Turning Text into Gold. You did a fantastic presentation at Day-to-Day Texas in back in January this year, uh, on the textual, uh, warehouse. And, and so I will read the beginning of this book, or part of the introduction of this book.
Text is the common fabric in society. You write businesses transacted in text. Arguments are made in text court basis.
Court cases are conducted in text. Conversations between friends transpire through text. In short text is the medium of exchange between people living on earth since the beginning of computing text justified.
The computer text is simply the original square peg in the round hole. Computer processes focus on structured transactions, not text. For most of it, of its early history, the computer was not much help in dealing with text.
That was a shame as some of the most important information was in the form of text. But today, and this is very much today in our era, but today, there exists advancements in technology that allow the computer to read, store and analyze text. And in doing so, a whole world of informed decision making is possible.
And so with that quote, I would like to open the conversation with the, the, the, uh, a question that really is about text itself. What is it about text that fascinates you, bill? What is the nature of text?
The, uh, my whole story with text begins, pardon me. My whole story with text begins, uh, uh, about 23 or four years ago. I, I was at the time working in the world of data warehousing, which is structured data.
And I sat down and asked myself the question, why do corporations only look at really a small percentage of their data, which is structured data? Why is text, uh, ignored? And, and, and, and that the, that started my journey towards understanding, uh, what the issues of text were.
At the time. I had no idea the complexity that, uh, faced me and, and other people that, uh, everybody takes text for granted because we speak language. But what we don't understand is in the background, each of our brains is, is, is is automatically processing thousands of rules all at the same time.
And we don't even think about it. Well, when you start to put text into a computer, you don't have, for the most part those rules. And, and, and, and, and that's what makes text so devilishly difficult for, for the, the computer.
Now, there's lots of lots of reasons why text is complex. I I have to say that, uh, in today's world, one of the things that really frustrates me is the attitude of many corporations saying, well, we have text. Let's just take chat, GPT and chat.
GPT solves our problems with text. And, and indeed chat, GPT does solve a certain set of problems with text. No question about that.
In fact, chat, GPT has opened up doors that have never been opened up before. But in terms of solving the problems with text, when it comes to business value chat, GPT really doesn't do that. And so let's talk a little bit about why there's, there's actually some very basic reasons why chat GPT and business value are somewhat divorced.
Uh, the first reason is, is that chat. GPT is text producing text. And, and for, for the purposes of chat GPT, that's just fine.
But for the purposes of doing, uh, uh, the analytical processing that we need to do, uh, uh, in the corporation, that's not fine. And so, in order to solve many of the problems of business, uh, value in the corporation today, we've got to have the information in the form of a database. So what does a database do for you that chat GPT doesn't do for you?
Well, it does a lot of things. Uh, and I'm gonna kind of rummage through a list here of, of, of things that, uh, that, uh, a database will do for you.
The first and probably the most important reason why a database is, is more valuable for business value. By the way, I'm not demeaning chat. GPT chat.
GPT does some wonderful things for people. It answers all kinds of interesting questions. But in terms of solving business value, it's not a terribly good tool.
Why? Number one, because of volumes of data. If you ask, for example, a doctor, how many, uh, medical records does he or she, uh, look at, uh, when they have a problematic patient, the doctor will tell you, well, 2025.
And that's because the doctors got to manually read the records themselves. However, when you are able to take text and put it into the form of a database, now you have an unlimited number of, of, uh, pieces of information you can look at. You can look at 10 million patients.
And certain analysis of medicine, uh, absolutely requires that you look at lots and lots of records. So the number, number, well, and there's lots of re differences. The number one difference is in terms of volume, because text has got to be read manually.
And because the database does not have to be read manually, there's a great difference in the volume of data that can be processed, number one. Number two, the, uh, very basis of the data itself. Chatt PT is good for looking at text found in places like the internet.
In fact, it's extremely good for that purpose. However, uh, for the data that's found in your organization, tucked away in databases, SQL server databases, Oracle databases, DB two databases. When data's tucked in there, chat, GPT, uh, either can't or has a very difficult time going in and finding that data.
And yet that data in the corporation is the data that, uh, uh, uh, is at the heart of servicing your business value. A third reason why a database is so important for analytical processing is, is that you can visualize the data coming out of a database. You can create a dashboard, you can create a knowledge graph.
You can even just put it into an Excel spreadsheet. But the truth of the matter is, is that visualization of data is very important for seeing the big picture. I I don't know if you've ever tried to take a, a database or a listing to a manager.
What does managers of the world do when, when they see a big pile of of information, what do they do? They ignore it. Managers look at charts.
Managers look at summarizations and, and, and trying to get summarizations and visualizations directly out of chat. GPT is very difficult. Trying to get those visualizations out of a database is very easy to do.
'cause that's what, uh, that's what the, uh, uh, data is there for. Then there's another topic, and that is the quality of the data itself. That, uh, for a variety of reasons, and I'm not an expert in chat, GPT, but chat, GPT produces these things called hallucinations.
That in terms of the reliability of the data and consistency of the data, uh, chat, GPT, uh, has a reputation of not doing that very well. The thing about a database that's created from text, you have 100% certainty that you know the source of the data that you can tie every word back to it's, its originating source, so that there's ever any question about the, uh, quality o of, of the data. Uh, you, you, you have that.
And my friend Shweta, I believe is, is, uh, more of an expert on, on this, uh, than I am. Another reason, and I I, I'll tell you, I'm, I'm gonna make this the last reason, but I could go on and on. Another reason why, uh, chat GPT is not particularly good for doing analytical processing is that analysts all the time come along and do what's called iterative or heuristic processing.
They submit a query and they say, oh, that's not quite right. I wanna, I wanna change things a little bit and resubmit the query. They do that, they look at the results and they say, oh, that's not quite right.
I want, and every time that you've got to go back to your source data, that costs a lot of machine cycles when you use chat GPT. However, now when you create your database from text, yes, you do have to go back to your source data, but you only have to go back to it once. That when the, when the, uh, analyst wants to change his or her mind about what they wanna ask, you don't have to go back and derive the data from the raw source.
You can simply go back to your database. And, and I, I've got a long list here. Uh, that's as far as I want to go take my word.
There are a lot of other reasons why, uh, doing analytical processing from, uh, uh, uh, chat GPT for business value, doing analytical processing for lots of other things, looking up for, uh, uh, what, what ship did Columbus not sail back to the new world in, uh, uh, what was the, the last name of the first person on the moon? Uh, what was the, uh, uh, score in the soccer game between Liverpool and Arsenal last night? Chachi pt, uh, uh, does an excellent job of that, that databases really aren't designed to do all, but when it comes to looking at business value.
And so this is why I get so frustrated to business managers. They think that chat GPT is a panacea. They, they, they think that they, they, they simply put chat GPT over all of their textual data in the corporation, and suddenly wonderful things happen.
And guess what? They don't. Thank you, uh, bill for this fantastic brand.
I, I prepared a lot of questions, um, and I, I expected, uh, knowledge and wisdom, but I, I, but the, but the level of energy and passion have to say that surprises me. But obviously I can see that you must be involved in a lot of talks around, uh, chat DBT that is only natural given, um, the current state of affairs and tech, right? But I wanna, I wanna circle back a little bit, um, to some of the distinctions that you have in the, in your book and some of the details that you unfolded and actually in several of your books.
So first of all, um, uh, just for the audience here, the, the vision you have on the enterprise, I think is perhaps a little different than, than chat GT in general, that is trained on text from the open web, right? Uh, enterprise text, at least, at least as I see it, can be somewhat different. You have a concept that you call boilerplate text.
What's boilerplate text? Well, we, we, I I I call it the business language model. When, when you take a look at the progression of language models, uh, if you're trying to look at everything in the world, you need a large language model.
You need to be able to understand anything anybody says. And, and, and however, when you go into business, you don't need to understand everything everybody says. You need to focus on the, uh, uh, uh, the business itself, airlines, uh, manufacturing, uh, pharmaceuticals, that the, the, the language that's contained in each of those businesses is, is different and pretty much unique to that particular business.
So, uh, uh, when you get ready to build your textual warehouse, uh, you don't focus in on the world. I, I'll tell you something, a true, a true large language model is, I'm gonna say impossible, is impossible to build you. You never are going to be finished, as in, ever.
And furthermore, if you were to finish, it would change by the time you finish, you'd have to go back and redo the thing over again. So, a a, a true large language mo model is an impossibility. However, uh, uh, in terms of the business language models, that's not an impossibility that you can focus on the language of, of, uh, uh, of, uh, uh, restaurants or, or, uh, or, or, or, or whatever industry you want to focus on.
So, uh, and, and by focusing on, uh, a, a given business, you now have a task that's finite. The, the task of dealing with an LLM is infinite, truly infinite. Uh, the task of dealing with a, uh, uh, a business, uh, uh, is a, is a challenging task, don't get me wrong.
But the challenging task, uh, uh, is still a finite, doable task. And so, uh, so No, but that, I completely agree on that though, and that I think opens a lot of very, very interesting possibilities exactly. Because of those limitations.
So maybe for the audience, uh, could we, like, uh, could we just, uh, briefly sketch or, or define what is, what is a textual warehouse then? What is it? A textual warehouse, uh, contains several, several elements.
Uh, uh, number one, it contains vocabulary, uh, uh, uh, uh, uh, vocabulary of, of, uh, uh, whatever enterprise it is that you're gonna be looking at. The second thing that it contains, uh, uh, uh, is context text is kind, is different, fundamentally different from data. Uh, when we have the amount of money a bank has loaned this month, that, that, that's a well-known piece of information.
We know what the context of that is. But when we see a word that somebody uses, in order for us to understand that word, we have to understand the context. So, number one, your textual warehouse has to contain vocabulary.
Second off, it has to contain, uh, the context for vocabulary. Now, context is kind of interesting. There's really two kinds of context.
There's what you would call source context, and there's what you call immediate context. Source context is the context that is, would normally be associated with a word, uh, say in a dictionary. Immediate context is the context of the word, uh, in the, in the context of the text that precedes the word and the text that follows the word.
Because oftentimes, uh, the, the text immediately preceding and following a word does affect the meaning of the word. So when we talk about context, uh, there, there, there's really two kinds of context, source context and immediate context. The next thing that the, uh, uh, the dictionary needs to contain, the, the, uh, the, um, uh, warehouse for textual warehouse needs to contain, uh, uh, is where the source comes from.
That, uh, uh, where did, when you were reading your document, you get the information from that's necessary. Because if somebody ever has a question about the, uh, uh, validity of the interpretation of the word, you can go all the way back to, uh, the source itself. Now, there are a lot of other mitigating factors.
Uh, one mitigating factor of a, a textual warehouse is the language itself. Uh, there's there, as, as enamored I am of the English language, because it's my native language. I'm the first person to recognize it's not the only language in the world.
There's, uh, uh, German, there's French, there's, uh, uh, Japanese, there's Chinese, there's Spanish. If I'm not mistaken, there are to be about 220 nines, uh, 220 recognized languages on Earth. That's, that's another mitigating factor.
Uh, so, uh, uh, uh, uh, and then another, another factor of the vocabulary is, uh, you've got to be aware in the vocabulary, uh, uh, of the different spellings of word, uh, uh, uh, and, and, and, and, and how, how you're interpreting agent, uh, uh, is going to treat the word. So those are all factors, uh, uh, of, of, uh, the, uh, what would go into a textual warehouse. Yeah.
Thank you very much for this answer. I, I certainly sense if you follow, uh, this universe closely and, and you have looked into, uh, the nature of large language models, I certainly sense, uh, uh, quite a different approach, uh, in the thinking of a textual warehouse than a, a large language model would be, which would be the architecture behind chat gt. Right?
Um, so, but I think we have to skip that exact, uh, discussion in the essence of time, because we need to, uh, we need to move on to at least one particular question that I, I would love to answer for you to answer, bill, because we can discuss this as, uh, as a theoretical architecture or something that would be nice to have, but that is not really the case for the Textual Warehouse. Uh, without mentioning names, could you mention some of the examples of your clients that you have already implemented, uh, textural warehouses for what kind of companies those companies are, and what the textual warehouse does in those companies? Sure.
And I, I, I'll make this a brief explanation in the interest of time, but, uh, but a while back, we were talking to, uh, a, uh, uh, an oil and gas company. This oil and gas company had many, many, uh, oil wells, uh, uh, in many places. Each oil well had its own set of documents, uh, documents about pumps, about, uh, pipes, about drilling bits, and, and a whole bunch of, um, uh, of information that each oil well had.
Uh, and these were in the form of documents. What happened is, is, uh, every now and then a vendor, uh, of the oil company would come along and say, uh, there's been, uh, a recall of a certain kind of pump. And the, the, the oil company had a problem.
They said, now we've got to go look out through the thousands of documents that we have, and how do we have to look through the documents? We have to look at them manually. And, and it was a tremendous effort and a very important effort, uh, to manually go through these documents.
So, uh, uh, the, the, the, uh, intent of the project was to, uh, be able to take the contents of the document, put them into a database, and now when a vendor comes along and says, we have some changes, you can now in a, in electronic fashion, look and find your documents. You don't have, it's much like the card catalog in a library. When you go into a library, you don't go, I don't know, maybe you do, but most people don't go into the library and look at stacks and stacks of books.
Instead, they go to the card catalog, they find what they're looking for in the card catalog, and then take that in the card catalog, and then go find the books they're looking for. That's a re again, you don't have to do it that way. It's just that, uh, that's the way you do it.
And, and, and, and, and, and, and so the, the application of being able to create a document card catalog for, uh, for the world, uh, uh, is one, by the way, there's a lot of other applications. This is only one that I have in mind, Wonderful, uh, for someone like me that it has a background and library information science. I say that this a lot, but, uh, but it's really, really interesting time to be alive for someone like me that grew up with, uh, text and metadata and library systems that got more and more digitized and like really deeply, uh, connected to the worldwide web movement.
They mention enough RDF, uh, uh, like search engine functionality, everything. We see this again now for the enterprise, uh, because of AI and, and, and what you're thinking in, in the context of, uh, of text, uh, bill, it's really impressive. And, and it makes me smile and dream, um, I have to say.
And, but, but, but moreover, I, we need to transition. Um, so, um, Schitz, I hope you feel ready. I, I wanted to ask you, I've been interviewing you on my podcast that I, uh, host together with, uh, chief Technical Officer of Emma.
And we talked about, uh, your concept, uh, concept quality management, which is a concept that I stumbled upon. I see Malcolm on the call. Um, he was also, uh, quite early, uh, in discovering your concept.
I, I follow Malcolm's, uh, ideas a lot also, but, but better, um, this concept that you have been putting forward, uh, in, in medium posts and on LinkedIn, how does, uh, how does, uh, content quality management connect to the idea of textual warehouse? Can you educate us on that? Yeah, absolutely.
Would love to actually talk about it. And I'm a big fan of Bill, by the way, so I often lose my words when I see him on the screen. So Bill has actually explained very beautifully as to how textual warehouse work, right?
And I think it is a breakthrough that finally makes that enterprise text accessible, right? It's like a foundation layer to me. It's pulling it together, standardizing it, and making it actually queryable at scale, right?
And I feel that without that enterprise are almost blind to, to the most of their own knowledge. Like Bill said, that most of the, most of the data is almost unstructured, so they're actually blind to that knowledge, okay? But then here's the real question, right?
Uh, to me in terms of like, once I have all of this text at one place, okay, as a textual warehouse, how do I decide what to trust, right? Because not all text is created equally. Some is very clear and some is very reliable.
Sometimes others are contradictory or incomplete as well, right? I feel that treating them the same, it actually confuses both the users in terms of like both users, like analysts or even your l LMS as well, right? So, the first thing I would like to confirm on this call is content quality management, or CQM, whatever you want to call it.
It doesn't replace the textual warehouse. I, I strongly feel it sits on top of it, right? The warehouse is a stage, right?
It ensures that all of the unstructured text is available, it is consistent, and it is wearable, just like Bill actually was talking about. Okay? But once the curtain rises, the real question is, which part of that text can you trust?
Can you reuse or even hand it over to the lms, right? And that's where the content quality management, uh, comes in picture. It works with textual warehouse, okay?
It's that qualification layer on top of the textual warehouse. It may, where when the textual warehouse is making sure that you have the text content quality management is making sure that you can now act on it with confidence, right? And this is what is also tying to the business outcome part of, uh, what Bill was alluding to, uh, just a minute ago.
Okay? Now, there's a lot of technical stuff that is required to actually make this happen. Okay?
I will talk about it, uh, here. Uh, we could have another session about it for sure. And I'm working on the technicalities of this, right?
Uh, but I would love to share, uh, uh, love to share as to why do we actually need this layer on top of the textual warehouse, right? So to answer that question to myself, I asked myself, and I, I, I picked up two things that I felt are very related to this topic, and I think Bill covered it so beautifully as well. Okay?
So two things. One is if you haven't read, uh, you should actually read the open AI's terms of use, okay? It explicitly actually says that you are responsible for the content, including ensuring that it does not violate any of the applicable law or these terms, right?
In other words, the burden of input quality sits with the enterprise, not with the model. It sits with you. Who is actually creating this data, whether it is in the textual warehouse format, or it is in the knowledge graph format or whatever you are creating, it's actually up to you to make sure that that encrypted quality is maintained.
Okay? And the number two I want to actually talk about is, there's a very recent study that came out, uh, from kaas, K-A-I-S-T. Okay?
It looked at why people get frustrated with chat. Beauty and Bill was actually so right about this, right? So the top causes, as per that paper and that study was the model is missing the intent, it's in, and it is, and the inaccurate responses as well.
And the striking part was the 72% of the cases, user couldn't fix it by reprompt network. It could, they couldn't fix that. It's not a failure of the model itself.
It's a reflection that it has it what it was fed to, right? So the input was actually, was a, was a problem and the cause for it, right? So if the content isn't qualify it upfront, the AI can not magically repair it.
And that's what Bill was actually also trying to tell you that it is good at some stuff, but not at everything, right? And this is where the content quality management actually comes in, okay? So if the textual warehouse puts you inside the matrix streams of text flowing everywhere, then I would call the CQM your neo, right?
It sees the signal in the noise, and it chooses what's real, and it delivers that usable, trustworthy input, okay? So in my opinion, the tru, the connection is very clear. Once you have your text in the textual warehouse, the governance layer, the taunt quality management layer is gonna make it available all from the, all from the governance perspective as to what text needs to be trusted, reliable, and what text is more confident to make your LLM responsible less hallucinating, right?
So that's the connection between the textual warehouse as to what you have as a text and what you can have as a trusted text as well. Yeah. So, thank you.
Special, that's what content quality is connected to the textual warehouse, from my opinion, very clearly laid out. I, I, I have more questions for you, but in the essence of time, because I, we both need to, to hear, uh, out, um, uh, I was about to call you my colleague, Jessica, but we have you so many times that, uh, and then also we have some brilliant questions in the q and a, uh, box. So I would love to get to some of those two, but that is not to, to, to, to make you feel that you should rush through, through your concept.
Uh, Jessica, I was in the room when the ontology pipeline as a, as an idea was born. And, uh, and I think it connects very, very nicely, um, with the idea of the textual warehouse as well. And I think like overall, content quality management, um, the ontology pipeline and the textual warehouse are ideas that makes us get more firm understanding on how we can proceed towards managing unstructured data for ai, right?
So, so Jessica, please, um, please elaborate and, and, and help us understand what is the ontology pipeline, how does it connect to these ideas and the idea of the text warehouse in particular. So, um, thanks Ola. Uh, uh, the ontology pipeline, um, really was Arif, or is a riff on, uh, the semantic spectrum from the semantic web.
Um, but it's very, uh, it codifies processes as well in library science for structuring, um, vocabularies and context and meaning. So, um, you know, very much like Bill alluded to the card catalog, which I obviously have an affinity for also being a librarian or a librarian information science person, um, starts with a controlled vocabulary. So, uh, the idea is how to structure that controlled vocabulary.
And the textual warehouse does that beautifully, um, uh, from controlled vocabulary. Uh, we structure and, and there's certain parts of the pipeline that are somewhat in interchangeable, but the idea is iterative steps and, um, stages of maturity for contextual vocabularies. Um, and so from the controlled vocabulary, we look at building a taxonomy, which is a hierarchy, um, from the hierarchy.
We go to thesaurus, um, thesaurus, uh, extends the taxonomy to have relationships. There's definitions and meaning. Um, and then from there we go to metadata schemas.
Obviously, metadata schemas can flip to a different part of the pipeline. That's the one piece of flexibility. And then ontologies, um, which adds context and meaning.
So it's the, um, encoding structure, and then obviously knowledge graphs. So, um, it's an iterative process in helping, um, to guide people, and it's measurable, um, and that's also very important, but it relates to the textual warehouse in that, um, the textual warehouse can surface and help to co-locate vocabulary context and meaning the ontology pipeline, um, can help to identify and, uh, and, and help to guide teams in being able to determine, um, definitions. For example, reconciling acronyms with, um, terminology, encoding it in, in a way that we're able to connect a concept with not only a definition, but a link to an authority source that validates that concept's meaning, and, and helps to, to, um, codify that that existence of that concept and its relationship to other things.
Um, within, for example, the, the textual warehouse. Wonderful. Wonderfully, uh, uh, laid out very, very, uh, uh, succinct.
Thank you, Jessica. Um, if there's, like, I get it. Uh, I get also that this was very briefly explained, so people may have questions, but in the essence of time, I, is it okay that we jump to the q and a now?
Because I see a lot of questions that I think we should answer. I just saw, though, I wanna mention today, Jessica, that your post on the Otology pipeline and on on LinkedIn had more than 800 likes, like yeah, incredibly, uh, impressed by that. And it's like, it's a very clear concept, and I love it.
So, so it's very well deserved, well deserved. Um, I know we can expect more, uh, on the ontology pipeline. So I, and it's interesting.
I do wanna add, what's, what's nice is I have talked to people that have implemented, um, the ontology pipeline. And it's not that you have to implement the entire pipeline. Some people working up to just taxonomies is enough, but you at least have a vision of an end game, should you choose, um, to take that opportunity.
Yeah. And that is like, no, I love it, but we will talk more about it. Um, okay, the q and a, I will go, uh, I will proceed with the questions in a chron, uh, chronological order.
I need to, uh, um, or at least if, if, if, if nothing else, just otherwise, but Kan has a question, and I guess it's for you, bill. Uh, are you proposing that we have an, uh, enterprise data warehouse and a textual data warehouse, or textual warehouse? I guess it, it, it should be, uh, combined.
So you're, are you thinking of combining data warehouses and, uh, textual warehouses? Uh, yes. You absolute.
Okay. The, I hate this answer. The answer is yes and no.
That, yes, you can combine them and it makes a lot of sense to, but do you have to combine them? No, you don't. Uh, uh, uh, and, and so again, I hate, I hate a wishy-washy answer, but the answer is, uh, yes and no.
Whatever makes the most business sense for your organization, I think to not sound wishy-washy, bill, maybe you could just say that the, these are decoupled concepts Yes, yes. That makes you, that makes you, that's, uh, uh, sound, that makes it sound more intentional, which I think it is, honestly, to defend you. Uh, how do you compare text stores to no SQL technologies and content stores?
A question from Paul, I guess it's also to you, bill, how do you compare? I'm, I'm familiar with technologies and content stores. Come again.
Can we hear, can I, can, did we hear the question? Can you hear me? No.
Uh, yeah, I, I, yeah. Uh, uh, I, I I, I, I hate to, I I'm not gonna answer the question because I, I, I, I don't know enough about, uh, the subject to, to, to get render an opinion. So, uh, I'm gonna have to pass on this one.
Yeah, sure. No problem. Uh, love the honesty.
Um, okay. A question about layered architecture, if I, I guess if I mean, yeah, sure, maybe Try to answer, it's a key value pair when most of the no SQL technologies like, uh, come in, right? It's mostly from that perspective.
So what Bill is alluding to is a proper warehouse that has more than key value pair, right? So that's, that could be probable answer here. Better.
Thank you. Um, the next one actually ties a little bit, uh, into what, what you, uh, added to the textual warehouse thinking, uh, Reta. Um, again, Kan asking, uh, will the textual warehouse have a similar architecture with lake warehouse curated layers, so forth?
So we're looking, so, so obviously, I guess Shweta, your point would be yes, it's a layered architecture.
Do you agree though on that? Yes, I do agree. Yeah.
Whatever you come up with, this would be on sitting on top of it. It's just like your data quality piece, right? When do you have the data quality?
Once you have your data ready, you actually have your data quality pipeline running on it, right? So that's the same thing for the content quality management. Once in any format, you have your data ready, okay?
You should be able to run, uh, the content quality management, like a module or a feature to give you only the text that is required to answer that particular question, right? Not all the text. Yeah.
Um, bill, you felt like chipping in or should I just have the next question? Uh, um, no, let's go to the next question. Okay.
Okay. Um, it's from Ramona, one of, uh, my very rare dear readers and, uh, a friend that I have never talked to, but really a, a great, great person that I suggest you connect with on LinkedIn and Substack. So, um, asks this question, and it's to you, bill, I will find one for you, Jessica, also.
So this one is for you, bill. I see alignment between bill's, textual warehouse, and small language models. I was thinking the same thing actually, when you went over the explan, so on a very specific business domain.
So one question I have in both context is, how is the tribal knowledge captured? Maybe, uh, bill, you can answer in the context of the textual warehouse. Okay?
How is I knowledge captured? This is actually a, uh, uh, uh, this is actually a very complex question, so I'm gonna try to give you a, a, a short, succinct answer, but I'm gonna tell you it's not, it not the complete answer. Uh, the truth of the matter is, when you're building your, uh, taxonomies, your, your business, uh, language model, uh, uh, you end up focusing on normally use words, words that, uh, uh, let's take banking word that somebody from Bank of America, Citi Corp, John JP Morgan, uh, and Wells Fargo, they would all understand the word, and that that's what goes into your, your, your, your business language model.
However, every corporation, for fact, every person in this world has their small amount of own private vocabulary things that you say that, that nobody else, uh, would say.
So when you build your, uh, business language model, you've got to build it so that, uh, uh, it's able to easily be modified and, and added to, uh, because, because nobody can build a, a, a business language model, uh, and include all of the customization itself. So we recognize that there has to be customization. And the best answer is, is, is when you go down to a particular organization, you find the, the, the customized vocabulary and quickly insert it into your business language model.
Now, I've given you a very high level answer. If you're interested in seeing actually how this works, I'd be happy to demonstrate it for you. But, but this is actually a complex question.
Yes, it's indeed a complex question. Um, but it's fantastic if I may say on a side note, that the global data and AI community is so, uh, well connected that we can have such a webinar of people participating from all over the world, from from Japan to Europe to the us. So that's great, at least even though we don't have time for complicated questions, Paul, and, and maybe this one is for you, Jessica, I'll, I'll ask you about this one.
Paul asks also, how is, uh, master data management applied for business context, if more focused on yeah, business, uh, language models versus leveraging generic, uh, LLM ideas. Obviously, it's a question for you, bill, but I'll, I'll try to let Jessica answer this one. Well, it's interesting.
I'm writing, um, a series of articles right now about metadata, um, and specifically looking at, uh, master DA data management and systems that we build that try to achieve some amount of, I guess, control or source of truth. And so, which I personally, and this may be controversial, I do see MDM as as somewhat limited because of the concept of a golden record and the idea that there's one language or one way of describing something to rule them all. Um, so in fact, you know, right now I think that many of us are, are trying to figure out, okay, Seman, we have these two concepts in front of us.
We have semantic layer, I feel like I just said a bad word, and, um, and master data management. And those can sometimes be very different and disparate processes. So the idea is to create a super flexible, and that's what ontologies do for us, is create a flexible model for describing these things that accommodates more than that one perfect way of saying something or capturing something.
Because the reality in businesses, as you cover ole in, in your book, um, fundamentals of metadata management, is that it's very, that social aspect of managing data and metadata and structuring data is very difficult to really, um, do successfully, if not impossible. And so the idea is to be able to accommodate and structure things using ontologies so that we're able to take a concept and capture all the nuances from within a business for how that thing is described, how that concept is described, and do it well for both humans and machines. So you have the text literal view, and then you have the backend view that's able to make a very machine readable, interoperable structure of that concept.
Very clear. Thank you. Thank you.
Uh, we have time for just a couple of more questions.
Uh, aash, um, asks, how is data stored or schema designed in a textual data warehouse? Like in a star s schemas, uh, data is stored in form of dimensions, facts, or maybe data match for reporting purpose. Also, how is data accessed by the end users?
Can I know more about architecture, architecture warehouse Okay, For you, bill? Yeah. Okay.
One more time. This is a very complex, uh, uh, question. I'm gonna try to give you the quickest, best answer, uh, that I can get.
Uh, that when we went to design something called textual, ETL, we knew that, uh, uh, first off, we had to have a single physical format for data, number one. You know, I, I, I hate to say this. I would love to answer the question properly.
I, I, I just don't have time to go into the nuances. So I, I'm, I'm frustrated because I, I, I do have a good answer for you. I'd love to tell you, um, data, the, the structure, the structure of data is fundamentally different in a, a textual warehouse than it is in a data warehouse.
In a data warehouse. The, uh, uh, the metadata describes the data in the column, the, in, in, in a textual warehouse. The metadata describes the data in a row.
And, and, and, and again, I, I have to, I know that this is not a good explanation, but, but that's the best I can do under the circumstances. Think I will take your hostage bill and say that we have an excellent blog@accion.com, where you can elaborate, uh, your point for us. I be, if, if you, if you wouldn't mind, I'd be happy to.
Thank you, bill. I'll, uh, I won't, I won't, uh, forget that Bill. Okay.
So I'll reach out, I'll reach out about that. Um, Jono, my good friend, Jono has a question. Oh, I guess, uh, I guess we're running out of time, but let's see, um, uh, what would the typical dimensions of a textural warehouse look like?
Trying to connect ones we might find the classical data in? Well, it's a little bit the same question, I guess, isn't it, bill? Yeah, It is.
Yeah. Okay. So there will be a blog post about this.
I just, uh, hijacked a Berlin Mon to write a, a blog post on. Okay, I'll be happy to. Thank you.
I know you're busy, so please, uh, I, if it's possible, it would be really nice. Thank you. But, um, Kimona, the last question.
Um, yes, Aash, we will have that blog post. I will tag you, don't worry. Okay.
Uh, and, and Ramona has the last question. As an aside, what will defined as immediate context is how a must language model is trained. I guess that's a common more than a question.
Yeah. Um, and by that, we conclude this data explored. This is a webinar series where we explored hot trends and topics in the data and AI community globally with, uh, authors, thought leaders, and, uh, people that are strategists, architects, leaders in, um, in big companies.
Today we discussed, um, the reclaiming of, uh, unstructured data as we call it, the, the, the textual warehouse, what it can do for text in the era of ai. We did this on the basis of turning text into gold, and actually also the textual warehouse that I have also read and that I really like. Bill, with, we, we interviewed you, bill Inman, thank you very much for, for coming, bill and, uh, also, uh, Jessica and better thank you for being, um, on, on our panel as experts that could contextualize, uh, this, uh, topic, uh, even more.
So thank you, bill, Jessica, and all of you. My pleasure. Thank you so much.
Thank you. Thank you everyone. Thank you.
Thank you. Bye. Take care.